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Preface 


The study of elliptic curves by algebraists, algebraic geometers and number theorists 
dates back to the middle of the nineteenth century. There now exists an extensive liter- 
ature that describes the beautiful and elegant properties of these marvelous objects. In 
1984, Hendrik Lenstra described an ingenious algorithm for factoring integers that re- 
lies on properties of elliptic curves. This discovery prompted researchers to investigate 
other applications of elliptic curves in cryptography and computational number theory. 

Public-key cryptography was conceived in 1976 by Whitfield Diffie and Martin Hell- 
man. The first practical realization followed in 1977 when Ron Rivest, Adi Shamir and 
Len Adleman proposed their now well-known RSA cryptosystem, in which security is 
based on the intractability of the integer factorization problem. Elliptic curve cryptog- 
raphy (ECC) was discovered in 1985 by Neal Koblitz and Victor Miller. Elliptic curve 
cryptographic schemes are public-key mechanisms that provide the same functional- 
ity as RSA schemes. However, their security is based on the hardness of a different 
problem, namely the elliptic curve discrete logarithm problem (ECDLP). Currently 
the best algorithms known to solve the ECDLP have fully exponential running time, 
in contrast to the subexponential-time algorithms known for the integer factorization 
problem. This means that a desired security level can be attained with significantly 
smaller keys in elliptic curve systems than is possible with their RSA counterparts. 
For example, it is generally accepted that a 160-bit elliptic curve key provides the same 
level of security as a 1024-bit RSA key. The advantages that can be gained from smaller 
key sizes include speed and efficient use of power, bandwidth, and storage. 


Audience This book is intended as a guide for security professionals, developers, and 
those interested in learning how elliptic curve cryptography can be deployed to secure 
applications. The presentation is targeted to a diverse audience, and generally assumes 
no more than an undergraduate degree in computer science, engineering, or mathemat- 
ics. The book was not written for theoreticians as is evident from the lack of proofs for 
mathematical statements. However, the breadth of coverage and the extensive surveys 
of the literature at the end of each chapter should make it a useful resource for the 
researcher. 


XX Preface 


Overview The book has a strong focus on efficient methods for finite field arithmetic 
(Chapter 2) and elliptic curve arithmetic (Chapter 3). Next, Chapter 4 surveys the 
known attacks on the ECDLP, and describes the generation and validation of domain 
parameters and key pairs, and selected elliptic curve protocols for digital signature, 
public-key encryption and key establishment. We chose not to include the mathemat- 
ical details of the attacks on the ECDLP, or descriptions of algorithms for counting 
the points on an elliptic curve, because the relevant mathematics is quite sophisticated. 
(Presenting these topics in a readable and concise form is a formidable challenge post- 
poned for another day.) The choice of material in Chapters 2, 3 and 4 was heavily 
influenced by the contents of ECC standards that have been developed by accred- 
ited standards bodies, in particular the FIPS 186-2 standard for the Elliptic Curve 
Digital Signature Algorithm (ECDSA) developed by the U.S. government’s National 
Institute for Standards and Technology (NIST). Chapter 5 details selected aspects of 
efficient implementations in software and hardware, and also gives an introduction to 
side-channel attacks and their countermeasures. Although the coverage in Chapter 5 
is admittedly narrow, we hope that the treatment provides a glimpse of engineering 
considerations faced by software developers and hardware designers. 
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CHAPTER l 


Introduction and Overview 


Elliptic curves have a rich and beautiful history, having been studied by mathematicians 
for over a hundred years. They have been used to solve a diverse range of problems. One 
example is the congruent number problem that asks for a classification of the positive 
integers occurring as the area of some right-angled triangle, the lengths of whose sides 
are rational numbers. Another example is proving Fermat’s Last Theorem which states 
that the equation x” + y” = z” has no nonzero integer solutions for x, y and z when the 
integer n is greater than 2. 


In 1985, Neal Koblitz and Victor Miller independently proposed using elliptic curves 
to design public-key cryptographic systems. Since then an abundance of research has 
been published on the security and efficient implementation of elliptic curve cryptogra- 
phy. In the late 1990’s, elliptic curve systems started receiving commercial acceptance 
when accredited standards organizations specified elliptic curve protocols, and private 
companies included these protocols in their security products. 


The purpose of this chapter is to explain the advantages of public-key cryptography 
over traditional symmetric-key cryptography, and, in particular, to expound the virtues 
of elliptic curve cryptography. The exposition is at an introductory level. We provide 
more detailed treatments of the security and efficient implementation of elliptic curve 
systems in subsequent chapters. 


We begin in §1.1 with a statement of the fundamental goals of cryptography and 
a description of the essential differences between symmetric-key cryptography and 
public-key cryptography. In §1.2, we review the RSA, discrete logarithm, and ellip- 
tic curve families of public-key systems. These systems are compared in §1.3 in which 
we explain the potential benefits offered by elliptic curve cryptography. A roadmap for 
the remainder of this book is provided in §1.4. Finally, §1.5 contains references to the 
cryptographic literature. 


2 1. Introduction and Overview 


1.1. Cryptography basics 


Cryptography is about the design and analysis of mathematical techniques that enable 
secure communications in the presence of malicious adversaries. 


Basic communications model 


In Figure 1.1, entities A (Alice) and B (Bob) are communicating over an unsecured 
channel. We assume that all communications take place in the presence of an adversary 
E (Eve) whose objective is to defeat any security services being provided to A and B. 
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Figure 1.1. Basic communications model. 


For example, A and B could be two people communicating over a cellular telephone 
network, and E£ is attempting to eavesdrop on their conversation. Or, A could be the 
web browser of an individual A who is in the process of purchasing a product from 
an online store B represented by its web site B. In this scenario, the communications 
channel is the Internet. An adversary E could attempt to read the traffic from A to B 
thus learning A’s credit card information, or could attempt to impersonate either A or 
B in the transaction. As a third example, consider the situation where A is sending 
an email message to B over the Internet. An adversary E could attempt to read the 
message, modify selected portions, or impersonate A by sending her own messages 
to B. Finally, consider the scenario where A is a smart card that is in the process 
of authenticating its holder A to the mainframe computer B at the headquarters of a 
bank. Here, E could attempt to monitor the communications in order to obtain A’s 
account information, or could try to impersonate A in order to withdraw funds from 
A’s account. It should be evident from these examples that a communicating entity 
is not necessarily a human, but could be a computer, smart card, or software module 
acting on behalf of an individual or an organization such as a store or a bank. 


Security goals 
Careful examination of the scenarios outlined above reveals the following fundamental 
objectives of secure communications: 


1. Confidentiality: keeping data secret from all but those authorized to see 
it—messages sent by A to B should not be readable by E. 
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2. Data integrity: ensuring that data has not been altered by unauthorized means— 
B should be able to detect when data sent by A has been modified by E. 


3. Data origin authentication: corroborating the source of data—B should be able 
to verify that data purportedly sent by A indeed originated with A. 


4. Entity authentication: corroborating the identity of an entity—B should be 
convinced of the identity of the other communicating entity. 


5. Non-repudiation: preventing an entity from denying previous commitments or 
actions—when B receives a message purportedly from A, not only is B con- 
vinced that the message originated with A, but B can convince a neutral third 
party of this; thus A cannot deny having sent the message to B. 


Some applications may have other security objectives such as anonymity of the 
communicating entities or access control (the restriction of access to resources). 


Adversarial model 


In order to model realistic threats faced by A and B, we generally assume that the 
adversary E has considerable capabilities. In addition to being able to read all data 
transmitted over the channel, E can modify transmitted data and inject her own data. 
Moreover, E has significant computational resources at her disposal. Finally, com- 
plete descriptions of the communications protocols and any cryptographic mechanisms 
deployed (except for secret keying information) are known to E.. The challenge to cryp- 
tographers is to design mechanisms to secure the communications in the face of such 
powerful adversaries. 


Symmetric-key cryptography 


Cryptographic systems can be broadly divided into two kinds. In symmetric-key 
schemes, depicted in Figure 1.2(a), the communicating entities first agree upon keying 
material that is both secret and authentic. Subsequently, they may use a symmetric-key 
encryption scheme such as the Data Encryption Standard (DES), RC4, or the Advanced 
Encryption Standard (AES) to achieve confidentiality. They may also use a message au- 
thentication code (MAC) algorithm such as HMAC to achieve data integrity and data 
origin authentication. 

For example, if confidentiality were desired and the secret key shared by A and B 
were k, then A would encrypt a plaintext message m using an encryption function ENC 
and the key k and transmit the resulting ciphertext c = ENC,(m) to B. On receiving c, 
B would use the decryption function DEC and the same key k to recover m = DECx(c). 
If data integrity and data origin authentication were desired, then A and B would first 
agree upon a secret key k, after which A would compute the authentication tag t = 
MAC;(m) of a plaintext message m using a MAC algorithm and the key k. A would 
then send m and ¢ to B. On receiving m and t, B would use the MAC algorithm and 
the same key k to recompute the tag t’ = MAC; (m) of m and accept the message as 
having originated from A if t = 1’. 
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Figure 1.2. Symmetric-key versus public-key cryptography. 


Key distribution and management The major advantage of symmetric-key cryptog- 
raphy is high efficiency; however, there are significant drawbacks to these systems. 
One primary drawback is the so-called key distribution problem—the requirement for 
a channel that is both secret and authenticated for the distribution of keying material. 
In some applications, this distribution may be conveniently done by using a physi- 
cally secure channel such as a trusted courier. Another way is to use the services of an 
on-line trusted third-party who initially establishes secret keys with all the entities in 
a network and subsequently uses these keys to securely distribute keying material to 
communicating entities when required.! Solutions such as these may be well-suited to 
environments where there is an accepted and trusted central authority, but are clearly 
impractical in applications such as email over the Internet. 

A second drawback is the key management problem—in a network of N entities, 
each entity may have to maintain different keying material with each of the other N — 1 
entities. This problem can be alleviated by using the services of an on-line trusted third- 
party that distributes keying material as required, thereby reducing the need for entities 
to securely store multiple keys. Again, however, such solutions are not practical in 
some scenarios. Finally, since keying material is shared between two (or more) entities, 
symmetric-key techniques cannot be used to devise elegant digital signature schemes 
that provide non-repudiation services. This is because it is impossible to distinguish 
between the actions taken by the different holders of a secret key.” 


Public-key cryptography 


The notion of public-key cryptography, depicted in Figure 1.2(b), was introduced in 
1975 by Diffie, Hellman and Merkle to address the aforementioned shortcomings 


This approach of using a centralized third-party to distribute keys for symmetric-key algorithms 
to parties as they are needed is used by the Kerberos network authentication protocol for client/server 
applications. 

2Digital signatures schemes can be designed using symmetric-key techniques; however, these schemes 
are generally impractical as they require the use of an on-line trusted third party or new keying material for 
each signature. 
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of symmetric-key cryptography. In contrast to symmetric-key schemes, public-key 
schemes require only that the communicating entities exchange keying material that 
is authentic (but not secret). Each entity selects a single key pair (e, d) consisting of a 
public key e, and a related private key d (that the entity keeps secret). The keys have the 
property that it is computationally infeasible to determine the private key solely from 
knowledge of the public key. 


Confidentiality If entity A wishes to send entity B a confidential message m, she ob- 
tains an authentic copy of B’s public key eg, and uses the encryption function ENC of a 
public-key encryption scheme to compute the ciphertext c = ENC,, (m). A then trans- 
mits c to B, who uses the decryption function DEC and his private key dz to recover the 
plaintext: m = DECg, (c). The presumption is that an adversary with knowledge only 
of eg (but not of dg) cannot decrypt c. Observe that there are no secrecy requirements 
on eg. It is essential only that A obtain an authentic copy of eg—otherwise A would 
encrypt m using the public key eg of some entity EF purporting to be B, and m would 
be recoverable by E. 


Non-repudiation Digital signature schemes can be devised for data origin authenti- 
cation and data integrity, and to facilitate the provision of non-repudiation services. 
An entity A would use the signature generation algorithm SIGN of a digital signature 
scheme and her private key d4 to compute the signature of a message: s = SIGNg, (1). 
Upon receiving m and s, an entity B who has an authentic copy of A’s public key e, 
uses a signature verification algorithm to confirm that s was indeed generated from 
m and d,. Since d, is presumably known only by A, B is assured that the message 
did indeed originate from A. Moreover, since verification requires only the non-secret 
quantities m and ea, the signature s for m can also be verified by a third party who 
could settle disputes if A denies having signed message m. Unlike handwritten sig- 
natures, A’s signature s depends on the message m being signed, preventing a forger 
from simply appending s to a different message m’ and claiming that A signed m’. 
Even though there are no secrecy requirements on the public key e,4, it is essential 
that verifiers should use an authentic copy of e4 when verifying signatures purportedly 
generated by A. 


In this way, public-key cryptography provides elegant solutions to the three problems 
with symmetric-key cryptography, namely key distribution, key management, and the 
provision of non-repudiation. It must be pointed out that, although the requirement 
for a secret channel for distributing keying material has been eliminated, implement- 
ing a public-key infrastructure (PKI) for distributing and managing public keys can 
be a formidable challenge in practice. Also, public-key operations are usually signifi- 
cantly slower than their symmetric-key counterparts. Hence, hybrid systems that benefit 
from the efficiency of symmetric-key algorithms and the functionality of public-key 
algorithms are often used. 

The next section introduces three families of public-key cryptographic systems. 
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1.2. Public-key cryptography 


In a public-key cryptographic scheme, a key pair is selected so that the problem of 
deriving the private key from the corresponding public key is equivalent to solving 
a computational problem that is believed to be intractable. Number-theoretic prob- 
lems whose intractability form the basis for the security of commonly used public-key 
schemes are: 


1. The integer factorization problem, whose hardness is essential for the security of 
RSA public-key encryption and signature schemes. 


2. The discrete logarithm problem, whose hardness is essential for the security of 
the ElGamal public-key encryption and signature schemes and their variants such 
as the Digital Signature Algorithm (DSA). 


3. The elliptic curve discrete logarithm problem, whose hardness is essential for the 
security of all elliptic curve cryptographic schemes. 


In this section, we review the basic RSA, ElGamal, and elliptic curve public-key en- 
cryption and signature schemes. We emphasize that the schemes presented in this 
section are the basic “textbook” versions, and enhancements to the schemes are re- 
quired (such as padding plaintext messages with random strings prior to encryption) 
before they can be considered to offer adequate protection against real attacks. Never- 
theless, the basic schemes illustrate the main ideas behind the RSA, discrete logarithm, 
and elliptic curve families of public-key algorithms. Enhanced versions of the basic 
elliptic curve schemes are presented in Chapter 4. 


1.2.1 RSA systems 


RSA, named after its inventors Rivest, Shamir and Adleman, was proposed in 1977 
shortly after the discovery of public-key cryptography. 


RSA key generation 


An RSA key pair can be generated using Algorithm 1.1. The public key consists of a 
pair of integers (n, e) where the RSA modulus n is a product of two randomly generated 
(and secret) primes p and q of the same bitlength. The encryption exponent e is an 
integer satisfying 1 < e < ¢@ and gcd(e, ¢) = 1 where ¢ = (p — 1)(g — 1). The private 
key d, also called the decryption exponent, is the integer satisfying 1 < d < @ and 
ed =1 (mod @). It has been proven that the problem of determining the private key d 
from the public key (n, e) is computationally equivalent to the problem of determining 
the factors p and q of n; the latter is the integer factorization problem (IFP). 
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Algorithm 1.1 RSA key pair generation 


INPUT: Security parameter /. 
OUTPUT: RSA public key (n, e) and private key d. 
1. Randomly select two primes p and g of the same bitlength //2. 
2. Compute n = pq and d= (p— 1)(q— 1). 
3. Select an arbitrary integer e with 1 < e < ¢ and gcd(e, d) = 1. 
4. Compute the integer d satisfying 1 <d < @ and ed =1 (mod @). 
5. Return(n, e, d). 


RSA encryption scheme 


RSA encryption and signature schemes use the fact that 
m =m _ (mod n) (1.1) 


for all integers m. The encryption and decryption procedures for the (basic) RSA 
public-key encryption scheme are presented as Algorithms 1.2 and 1.3. Decryption 
works because c/ = (m°)4 =m (mod n), as derived from expression (1.1). The se- 
curity relies on the difficulty of computing the plaintext m from the ciphertext c = 
m® mod n and the public parameters n and e. This is the problem of finding eth roots 
modulo n and is assumed (but has not been proven) to be as difficult as the integer 
factorization problem. 


Algorithm 1.2 Basic RSA encryption 


INPUT: RSA public key (n, e), plaintext m € [0,n — 1]. 
OUTPUT: Ciphertext c. 

1. Compute c = m® mod n. 

2. Return(c). 


Algorithm 1.3 Basic RSA decryption 


INPUT: RSA public key (n, e), RSA private key d, ciphertext c. 
OUTPUT: Plaintext m. 

1. Compute m = c4 mod n. 

2. Return(m). 


RSA signature scheme 


The RSA signing and verifying procedures are shown in Algorithms 1.4 and 1.5. The 
signer of a message m first computes its message digest h = H(m) using a crypto- 
graphic hash function H, where / serves as a short fingerprint of m. Then, the signer 
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uses his private key d to compute the eth root s of h modulo n: s =h@ mod n. Note that 
s° =h (mod n) from expression (1.1). The signer transmits the message m and its sig- 
nature s to a verifying party. This party then recomputes the message digest h = H(m), 
recovers a message digest h’ = s° mod n from s, and accepts the signature as being 
valid for m provided that h = h’. The security relies on the inability of a forger (who 
does not know the private key d) to compute eth roots modulo n. 


Algorithm 1.4 Basic RSA signature generation 


INPUT: RSA public key (n, e), RSA private key d, message m. 
OUTPUT: Signature s. 

1. Compute h = H(m) where H is a hash function. 

2. Compute s = h@ mod n. 

3. Return(s). 


Algorithm 1.5 Basic RSA signature verification 


INPUT: RSA public key (n, e), message m, signature s. 
OUTPUT: Acceptance or rejection of the signature. 
1. Compute h = H(m). 
2. Compute h’ = s* mod n. 
3. Ifh =h’ then return(“Accept the signature”); 
Else return(“Reject the signature’). 


The computationally expensive step in any RSA operation is the modular exponenti- 
ation, e.g., computing m° mod n in encryption and c4 mod n in decryption. In order to 
increase the efficiency of encryption and signature verification, one can select a small 
encryption exponent e; in practice, e = 3 or e = 2!©+ 1 is commonly chosen. The de- 
cryption exponent d is of the same bitlength as n. Thus, RSA encryption and signature 
verification with small exponent e are significantly faster than RSA decryption and 
signature generation. 


1.2.2 Discrete logarithm systems 


The first discrete logarithm (DL) system was the key agreement protocol proposed 
by Diffie and Hellman in 1976. In 1984, ElGamal described DL public-key encryp- 
tion and signature schemes. Since then, many variants of these schemes have been 
proposed. Here we present the basic ElGamal public-key encryption scheme and the 
Digital Signature Algorithm (DSA). 
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DL key generation 


In discrete logarithm systems, a key pair is associated with a set of public domain 
parameters (p,q, g). Here, p is a prime, g is a prime divisor of p—1, andg €[1, p—1] 
has order q (i.e., t = q is the smallest positive integer satisfying g’ = 1 (mod p)). 
A private key is an integer x that is selected uniformly at random from the interval 
[1, qg — 1] (this operation is denoted x €r [1, q —1]), and the corresponding public key 
is y= g* mod p. The problem of determining x given domain parameters (p,q, g) and 
y is the discrete logarithm problem (DLP). We summarize the DL domain parameter 
generation and key pair generation procedures in Algorithms 1.6 and 1.7, respectively. 


Algorithm 1.6 DL domain parameter generation 


INPUT: Security parameters /, f. 
Output: DL domain parameters (p,q, g). 
1. Select a t-bit prime g and an /-bit prime p such that g divides p — 1. 
2. Select an element g of order q: 
2.1 Select arbitrary h € [1, p—1] and compute g = h'?—)/4 mod p. 
2.2 If g = 1 then go to step 2.1. 
3. Return(p, g, g). 


Algorithm 1.7 DL key pair generation 


INPUT: DL domain parameters (p,q, g). 
OUTPUT: Public key y and private key x. 
1. Select x €r [1,g—1]. 
2. Compute y = g* mod p. 
3. Return(y, x). 


DL encryption scheme 


We present the encryption and decryption procedures for the (basic) ElGamal public- 
key encryption scheme as Algorithms 1.8 and 1.9, respectively. If y is the intended 
recipient’s public key, then a plaintext m is encrypted by multiplying it by y* mod p 
where k is randomly selected by the sender. The sender transmits this product cz = 
my* mod p and also c; = g* mod p to the recipient who uses her private key to 
compute 

ct =g**=y* (mod p) 


and divides c2 by this quantity to recover m. An eavesdropper who wishes to recover 
m needs to calculate y* mod p. This task of computing y“ mod p from the domain pa- 
rameters (p,q, g), y, and c; = g* mod p is called the Diffie-Hellman problem (DHP). 


10 1. Introduction and Overview 


The DHP is assumed (and has been proven in some cases) to be as difficult as the 
discrete logarithm problem. 


Algorithm 1.8 Basic ElGamal encryption 


INPUT: DL domain parameters (p,q, g), public key y, plaintext m € [0, p — 1]. 
OUTPUT: Ciphertext (c1, c2). 

1. Select k Er [1,g—1). 

2. Compute c; = g* mod p. 

3. Compute c2 = m- y* mod p. 

4. Return(c1, c2). 


Algorithm 1.9 Basic ElGamal decryption 
INPUT: DL domain parameters (p,q, g), private key x, ciphertext (cj, c2). 
OUTPUT: Plaintext m. 

1. Compute m = cy-c;* mod p. 

2. Return(m). 


DL signature scheme 


The Digital Signature Algorithm (DSA) was proposed in 1991 by the U.S. National 
Institute of Standards and Technology (NIST) and was specified in a U.S. Government 
Federal Information Processing Standard (FIPS 186) called the Digital Signature Stan- 
dard (DSS). We summarize the signing and verifying procedures in Algorithms 1.10 
and 1.11, respectively. 

An entity A with private key x signs a message by selecting a random integer k from 
the interval [1, g — 1], and computing T = g* mod p, r = T mod q and 


s =k~!(h+xr) mod q (1.2) 


where h = H(m) is the message digest. A’s signature on m is the pair (r,s). To verify 
the signature, an entity must check that (7, 5) satisfies equation (1.2). Since the verifier 
knows neither A’s private key x nor k, this equation cannot be directly verified. Note, 
however, that equation (1.2) is equivalent to 


k=s~'(h+xr) (mod q). (1.3) 
Raising g to both sides of (1.3) yields the equivalent congruence 
T=" y'>' (mod p). 


The verifier can therefore compute T and then check that r = T mod q. 
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Algorithm 1.10 DSA signature generation 


INPUT: DL domain parameters (p,q, g), private key x, message m. 
OUTPUT: Signature (r,s). 
1. Select k Er [1,g—1). 
. Compute T = g* mod p. 
. Compute r = T mod gq. If r = 0 then go to step 1. 
. Compute h = H(m). 
. Compute s = k~!(h+xr) mod q. If s =0 then go to step 1. 
. Return(7, s). 
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Algorithm 1.11 DSA signature verification 


INPUT: DL domain parameters (p,q, g), public key y, message m, signature (r,s). 
OUTPUT: Acceptance or rejection of the signature. 
1. Verify that r and s are integers in the interval [1, q — 1]. If any verification fails 
then return(“Reject the signature’’). 
. Compute h = H(m). 
. Compute w = s~! mod q. 
. Compute vu; = hw mod q and u2 =rw mod gq. 
. Compute T = g"! y”2 mod p. 
. Compute r’ = T mod q. 
. Ifr =r’ then return(“Accept the signature”); 
Else return(“Reject the signature’). 
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1.2.3 Elliptic curve systems 


The discrete logarithm systems presented in §1.2.2 can be described in the abstract 
setting of a finite cyclic group. We introduce some elementary concepts from group 
theory and explain this generalization. We then look at elliptic curve groups and show 
how they can be used to implement discrete logarithm systems. 


Groups 


An abelian group (G,*) consists of a set G with a binary operation *: Gx G—-G 
satisfying the following properties: 
(1) (Associativity) a (b*c) = (a*xb) «c for all a,b,c € G. 
(11) (Existence of an identity) There exists an element e € G such that axe =exa=a 
for alla eG. 
(iii) (Existence of inverses) For each a € G, there exists an element b € G, called the 
inverse of a, such thataxb=bxa=e. 
(iv) (Commutativity) axb = bxa for alla, be G. 
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The group operation is usually called addition (+-) or multiplication (-). In the first in- 
stance, the group is called an additive group, the (additive) identity element is usually 
denoted by 0, and the (additive) inverse of a is denoted by —a. In the second instance, 
the group is called a multiplicative group, the (multiplicative) identity element is usu- 
ally denoted by 1, and the (multiplicative) inverse of a is denoted by a~!. The group is 
finite if G is a finite set, in which case the number of elements in G is called the order 
of G. 


For example, let p be a prime number, and let F, = {0,1,2,..., p — 1} denote the set 
of integers modulo p. Then (F,,, +), where the operation + is defined to be addition of 
integers modulo p, is a finite additive group of order p with (additive) identity element 
0. Also, : -), where denotes the nonzero elements in F, and the operation - is 
defined to be multiplication of integers modulo p, is a finite multiplicative group of 
order p — 1 with (multiplicative) identity element 1. The triple (F,, +, -) is a finite field 
(cf. §2.1), denoted more succinctly as F p. 

Now, if G is a finite multiplicative group of order n and g € G, then the smallest 
positive integer ¢ such that g’ = 1 is called the order of g; such a t always exists and 
is a divisor of n. The set (g) = {g' :0 <i < t—1} of all powers of g is itself a group 
under the same operation as G, and is called the cyclic subgroup of G generated by 
g. Analogous statements are true if G is written additively. In that instance, the order 
of g € Gis the smallest positive divisor t of n such that tg = 0, and (g) = {ig: 0 < 
i <t—1}. Here, tg denotes the element obtained by adding ¢ copies of g. If G has an 
element g of order n, then G is said to be a cyclic group and g is called a generator of 
G. 


For example, with the DL domain parameters (p,q, g) defined as in §1.2.2, the mul- 
tiplicative group (F*, -) is a cyclic group of order p — 1. Furthermore, (g) is a cyclic 
subgroup of order q. 


Generalized discrete logarithm problem 


Suppose now that (G,-) is a multiplicative cyclic group of order n with generator g. 
Then we can describe the discrete logarithm systems presented in §1.2.2 in the setting 
of G. For instance, the domain parameters are g and n, the private key is an integer 
x selected randomly from the interval [1,7 — 1], and the public key is y = g’. The 
problem of determining x given g, n and y is the discrete logarithm problem in G. 


In order for a discrete logarithm system based on G to be efficient, fast algo- 
rithms should be known for computing the group operation. For security, the discrete 
logarithm problem in G should be intractable. 

Now, any two cyclic groups of the same order n are essentially the same; that is, 
they have the same structure even though the elements may be written differently. The 
different representations of group elements can result in algorithms of varying speeds 
for computing the group operation and for solving the discrete logarithm problem. 
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The most popular groups for implementing discrete logarithm systems are the cyclic 
subgroups of the multiplicative group of a finite field (discussed in §1.2.2), and cyclic 
subgroups of elliptic curve groups which we introduce next. 


Elliptic curve groups 


Let p be a prime number, and let F, denote the field of integers modulo p. An elliptic 
curve E over FF, is defined by an equation of the form 


yy =x? +ax+b, (1.4) 


where a, b € Fy satisfy 4a> +27b? £0 (mod p). A pair (x, y), where x, y € Fp, isa 
point on the curve if (x, y) satisfies the equation (1.4). The point at infinity, denoted by 
00, is also said to be on the curve. The set of all the points on E is denoted by E(F,). 
For example, if E is an elliptic curve over F7 with defining equation 


y" =x? 4+2x+4, 
then the points on EF are 
E(F7) = {00, (0, 2), (0,5), (1,0), (2, 3), 2,4), G, 3), G4), ©, 1), (6, 6)}. 


Now, there is a well-known method for adding two elliptic curve points (x1, y1) and 
(x2, y2) to produce a third point on the elliptic curve (see §3.1). The addition rule re- 
quires a few arithmetic operations (addition, subtraction, multiplication and inversion) 
in F’, with the coordinates x1, yj, x2, y2. With this addition rule, the set of points E(F ,) 
forms an (additive) abelian group with oo serving as the identity element. Cyclic sub- 
groups of such elliptic curve groups can now be used to implement discrete logarithm 
systems. 

We next illustrate the ideas behind elliptic curve cryptography by describing an 
elliptic curve analogue of the DL encryption scheme that was introduced in §1.2.2. 
Such elliptic curve systems, and also the elliptic curve analogue of the DSA signature 
scheme, are extensively studied in Chapter 4. 


Elliptic curve key generation 


Let E be an elliptic curve defined over a finite field F,,. Let P be a point in E(F,), and 
suppose that P has prime order n. Then the cyclic subgroup of E(F,) generated by P 
is 

(P) ={oo, P,2P,3P,...,(a1—1)P}. 


The prime p, the equation of the elliptic curve EF, and the point P and its order n, are 
the public domain parameters. A private key is an integer d that is selected uniformly 
at random from the interval [1,7 — 1], and the corresponding public key is Q =dP. 
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The problem of determining d given the domain parameters and Q is the elliptic curve 
discrete logarithm problem (ECDLP). 


Algorithm 1.12 Elliptic curve key pair generation 


INPUT: Elliptic curve domain parameters (p, E, P,n). 
OUTPUT: Public key Q and private key d. 

1. Select d Er [1,n—1]. 

2. Compute Q =dP. 

3. Return(Q,d). 


Elliptic curve encryption scheme 


We present the encryption and decryption procedures for the elliptic curve analogue 
of the basic ElGamal encryption scheme as Algorithms 1.13 and 1.14, respectively. A 
plaintext m is first represented as a point M, and then encrypted by adding it to kQ 
where k is a randomly selected integer, and Q is the intended recipient’s public key. 
The sender transmits the points Cj = kP and C7 = M+kQ to the recipient who uses 
her private key d to compute 


dC, =d(kP) =k(dP) =k@Q, 


and thereafter recovers M = Cz —kQ. An eavesdropper who wishes to recover M 
needs to compute kQ. This task of computing kQ from the domain parameters, Q, and 
C; = kP, is the elliptic curve analogue of the Diffie-Hellman problem. 


Algorithm 1.13 Basic ElGamal elliptic curve encryption 


INPUT: Elliptic curve domain parameters (p, E, P,n), public key Q, plaintext m. 
OUTPUT: Ciphertext (C1, C2). 

1. Represent the message m as a point M in E(F,). 

2. Select k Er [1,n—1]. 

3. Compute Cy =kP. 

4. Compute C7 = M+kQ. 

5. Return(C1, C2). 


Algorithm 1.14 Basic ElGamal elliptic curve decryption 


INPUT: Domain parameters (p, FE, P,n), private key d, ciphertext (C), C2). 
OUTPUT: Plaintext m. 

1. Compute M = Cz — dC, and extract m from M. 

2. Return(m). 
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There are several criteria that need to be considered when selecting a family of public- 
key schemes for a specific application. The principal ones are: 


1. Functionality. Does the public-key family provide the desired capabilities? 
2. Security. What assurances are available that the protocols are secure? 


3. Performance. For the desired level of security, do the protocols meet performance 
objectives? 
Other factors that may influence a decision include the existence of best-practice stan- 
dards developed by accredited standards organizations, the availability of commercial 
cryptographic products, patent coverage, and the extent of existing deployments. 

The RSA, DL and EC families introduced in §1.2 all provide the basic functional- 
ity expected of public-key cryptography—encryption, signatures, and key agreement. 
Over the years, researchers have developed techniques for designing and proving the 
security of RSA, DL and EC protocols under reasonable assumptions. The fundamental 
security issue that remains is the hardness of the underlying mathematical problem that 
is necessary for the security of all protocols in a public-key family—the integer factor- 
ization problem for RSA systems, the discrete logarithm problem for DL systems, and 
the elliptic curve discrete logarithm problem for EC systems. The perceived hardness 
of these problems directly impacts performance since it dictates the sizes of the domain 
and key parameters. That in turn affects the performance of the underlying arithmetic 
operations. 

In the remainder of this section, we summarize the state-of-the-art in algorithms 
for solving the integer factorization, discrete logarithm, and elliptic curve discrete 
logarithm problems. We then give estimates of parameter sizes providing equivalent 
levels of security for RSA, DL and EC systems. These comparisons illustrate the ap- 
peal of elliptic curve cryptography especially for applications that have high security 
requirements. 

We begin with an introduction to some relevant concepts from algorithm analysis. 


Measuring the efficiency of algorithms 


The efficiency of an algorithm is measured by the scarce resources it consumes. Typi- 
cally the measure used is time, but sometimes other measures such as space and number 
of processors are also considered. It is reasonable to expect that an algorithm consumes 
greater resources for larger inputs, and the efficiency of an algorithm is therefore de- 
scribed as a function of the input size. Here, the size is defined to be the number of bits 
needed to represent the input using a reasonable encoding. For example, an algorithm 
for factoring an integer n has input size / = [logy n| + 1 bits. 

Expressions for the running time of an algorithm are most useful if they are inde- 
pendent of any particular platform used to implement the algorithm. This is achieved 
by estimating the number of elementary operations (e.g., bit operations) executed. The 
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(worst-case) running time of an algorithm is an upper bound, expressed as a function 
of the input size, on the number of elementary steps executed by the algorithm. For ex- 
ample, the method of trial division which factors an integer n by checking all possible 
factors up to ./7 has a running time of approximately ./n ~ 2!/ division steps. 

It is often difficult to derive exact expressions for the running time of an algorithm. 
In these situations, it is convenient to use “big-O” notation. If f and g are two positive 
real-valued functions defined on the positive integers, then we write f = O(g) when 
there exist positive constants c and L such that f(/) < cg(/) for all / > L. Informally, 
this means that, asymptotically, f(/) grows no faster than g(/) to within a constant 
multiple. Also useful is the “little-o” notation. We write f = o(g) if for any positive 
constant c there exists a constant L such that f(/) < cg(l) for / > L. Informally, this 
means that f (/) becomes insignificant relative to g(/) for large values of /. 

The accepted notion of an efficient algorithm is one whose running time is bounded 
by a polynomial in the input size. 


Definition 1.15 Let A be an algorithm whose input has bitlength /. 
(i) A is a polynomial-time algorithm if its running time is O(/°) for some constant 

c>0. 

(ii) A is an exponential-time algorithm if its running time is not of the form O(/°) 
for any c > 0. 

(iii) A is a subexponential-time algorithm if its running time is O(2°”), and A is not 
a polynomial-time algorithm. 

(iv) A is 2 fully-exponential-time algorithm if its running time is not of the form 
O}?), 


It should be noted that a subexponential-time algorithm is also an exponential-time al- 
gorithm and, in particular, is not a polynomial-time algorithm. However, the running 
time of a subexponential-time algorithm does grow slower than that of a fully- 
exponential-time algorithm. Subexponential functions commonly arise when analyzing 
the running times of algorithms for factoring integers and finding discrete logarithms. 


Example 1.16 (subexponential-time algorithm) Let A be an algorithm whose input is 
an integer n or a small set of integers modulo n (so the input size is O(log, n)). If the 
running time of A is of the form 


Lila cl =O eee ee teen) 


where c is a positive constant and @ is a constant satisfying 0 < a < 1, then A is 
a subexponential-time algorithm. Observe that if a = 0 then L,[0,c] is a polyno- 
mial expression in log,n (so A is a polynomial-time algorithm), while if « = 1 then 
L,[1, c] is fully-exponential expression in log, n (so A is a fully-exponential-time algo- 
rithm). Thus the parameter a is a good benchmark of how close a subexponential-time 
algorithm is to being efficient (polynomial-time) or inefficient (fully-exponential-time). 
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Solving integer factorization and discrete logarithm problems 


We briefly survey the state-in-the-art in algorithms for the integer factorization, discrete 
logarithm, and elliptic curve discrete logarithm problems. 


Algorithms for the integer factorization problem Recall that an instance of the in- 
teger factorization problem is an integer n that is the product of two //2-bit primes; the 
input size is O(/) bits. The fastest algorithm known for factoring such n is the Number 
Field Sieve (NFS) which has a subexponential expected running time of 


1 
Els, 1.923]. (1.5) 


The NFS has two stages: a sieving stage where certain relations are collected, and a 
matrix stage where a large sparse system of linear equations is solved. The sieving 
stage is easy to parallelize, and can be executed on a collection of workstations on the 
Internet. However, in order for the sieving to be efficient, each workstation should have 
a large amount of main memory. The matrix stage is not so easy to parallelize, since 
the individual processors frequently need to communicate with one another. This stage 
is more effectively executed on a single massively parallel machine, than on a loosely 
coupled network of workstations. 

As of 2003, the largest RSA modulus factored with the NFS was a 530-bit (160- 
decimal digit) number. 


Algorithms for the discrete logarithm problem Recall that the discrete logarithm 
problem has parameters p and q where p is an /-bit prime and q is a t-bit prime divisor 
of p — 1; the input size is O(/) bits. The fastest algorithms known for solving the dis- 
crete logarithm problem are the Number Field Sieve (NFS) which has a subexponential 
expected running time of 


1 
Lolz, 1.923], (1.6) 


and Pollard’s rho algorithm which has an expected running time of 


q 
3" (1.7) 
The comments made above for the NFS for integer factorization also apply to the NFS 
for computing discrete logarithms. Pollard’s rho algorithm can be easily parallelized 
so that the individual processors do not have to communicate with each other and only 
occasionally communicate with a central processor. In addition, the algorithm has only 
very small storage and main memory requirements. 
The method of choice for solving a given instance of the DLP depends on the sizes 
of the parameters p and q, which in turn determine which of the expressions (1.6) 
and (1.7) represents the smaller computational effort. In practice, DL parameters are 
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selected so that the expected running times in expressions (1.6) and (1.7) are roughly 
equal. 

As of 2003, the largest instance of the DLP solved with the NFS is for a 397-bit 
(120-decimal digit) prime p. 


Algorithms for the elliptic curve discrete logarithm problem Recall that the 
ECDLP asks for the integer d € [1,n — 1] such that Q = dP, where n is a t-bit prime, 
P is a point of order n on an elliptic curve defined over a finite field F,, and Q € (P). 
If we assume that n * p, as is usually the case in practice, then the input size is O(t) 
bits. The fastest algorithm known for solving the ECDLP is Pollard’s rho algorithm 
(cf. §4.1) which has an expected running time of 


5 (1.8) 


The comments above concerning Pollard’s rho algorithm for solving the ordinary 
discrete logarithm problem also apply to solving the ECDLP. 

As of 2003, the largest ECDLP instance solved with Pollard’s rho algorithm is for 
an elliptic curve over a 109-bit prime field. 


Key size comparisons 


Estimates are given for parameter sizes providing comparable levels of security for 
RSA, DL, and EC systems, under the assumption that the algorithms mentioned above 
are indeed the best ones that exist for the integer factorization, discrete logarithm, and 
elliptic curve discrete logarithm problems. Thus, we do not account for fundamental 
breakthroughs in the future such as the discovery of significantly faster algorithms or 
the building of a large-scale quantum computer.* 

If time is the only measure used for the efficiency of an algorithm, then the param- 
eter sizes providing equivalent security levels for RSA, DL and EC systems can be 
derived using the running times in expressions (1.5), (1.6), (1.7) and (1.8). The pa- 
rameter sizes, also called key sizes, that provide equivalent security levels for RSA, 
DL and EC systems as an 80-, 112-, 128-, 192- and 256-bit symmetric-key encryption 
scheme are listed in Table 1.1. By a security level of k bits we mean that the best algo- 
rithm known for breaking the system takes approximately 2" steps. These five specific 
security levels were selected because they represent the amount of work required to per- 
form an exhaustive key search on the symmetric-key encryption schemes SKIPJACK, 
Triple-DES, AES-Small, AES-Medium, and AES-Large, respectively. 

The key size comparisons in Table 1.1 are somewhat unsatisfactory in that they are 
based only on the time required for the NFS and Pollard’s rho algorithms. In particular, 
the NFS has several limiting factors including the amount of memory required for 


3 Efficient algorithms are known for solving the integer factorization, discrete logarithm, and elliptic curve 
discrete logarithm problems on quantum computers (see the notes on page 196). However, it is still unknown 
whether large-scale quantum computers can actually be built. 
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Security level (bits) 
80 112 128 192 256 
(SKIPJACK) (Triple-DES) (AES-Small) (AES-Medium) (AES-Large) 


DL parameter q 
EC parameter n 
RSA modulus n 
DL modulus p 


160 256 384 512 


1024 2048 3072 8192 15360 





Table 1.1. RSA, DL and EC key sizes for equivalent security levels. Bitlengths are given for 
the DL parameter g and the EC parameter n, and the RSA modulus n and the DL modulus p, 
respectively. 


the sieving stage, the size of the matrix, and the difficulty in parallelizing the matrix 
stage, while these factors are not present in the analysis of Pollard’s rho algorithm. It 
is possible to provide cost-equivalent key sizes that take into account the full cost of 
the algorithms—that is, both the running time as well as the cost to build or otherwise 
acquire the necessary hardware. However, such costs are difficult to estimate with a 
reasonable degree of precision. Moreover, recent work has shown that the full cost 
of the sieving and matrix stages can be significantly reduced by building customized 
hardware. It therefore seems prudent to take a conservative approach and only use time 
as the measure of efficiency for the NFS and Pollard’s rho algorithms. 

The comparisons in Table 1.1 demonstrate that smaller parameters can be used in 
elliptic curve cryptography (ECC) than with RSA and DL systems at a given security 
level. The difference in parameter sizes is especially pronounced for higher security 
levels. The advantages that can be gained from smaller parameters include speed (faster 
computations) and smaller keys and certificates. In particular, private-key operations 
(such as signature generation and decryption) for ECC are many times more efficient 
than RSA and DL private-key operations. Public-key operations (such as signature ver- 
ification and encryption) for ECC are many times more efficient than for DL systems. 
Public-key operations for RSA are expected to be somewhat faster than for ECC if a 
small encryption exponent e (such as e = 3 or e = 2!© + 1) is selected for RSA. The 
advantages offered by ECC can be important in environments where processing power, 
storage, bandwidth, or power consumption is constrained. 


1.4 Roadmap 


Before implementing an elliptic curve system, several selections have to be made 
concerning the finite field, elliptic curve, and cryptographic protocol: 


1. a finite field, a representation for the field elements, and algorithms for 
performing field arithmetic; 
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2. an elliptic curve, a representation for the elliptic curve points, and algorithms for 
performing elliptic curve arithmetic; and 


3. a protocol, and algorithms for performing protocol arithmetic. 


There are many factors that can influence the choices made. All of these must be 
considered simultaneously in order to arrive at the best solution for a particular appli- 
cation. Relevant factors include security considerations, application platform (software 
or hardware), constraints of the particular computing environment (e.g., processing 
speed, code size (ROM), memory size (RAM), gate count, power consumption), and 
constraints of the particular communications environment (e.g., bandwidth, response 
time). 


Not surprisingly, it is difficult, if not impossible, to decide on a single “best” set of 
choices. For example, the optimal choices for a workstation application can be quite 
different from the optimal choices for a smart card application. The purpose of this 
book is to provide security practitioners with a comprehensive account of the vari- 
ous implementation and security considerations for elliptic curve cryptography, so that 
informed decisions of the most suitable options can be made for particular applications. 


The remainder of the book is organized as follows. Chapter 2 gives a brief intro- 
duction to finite fields. It then presents algorithms that are well-suited for software 
implementation of the arithmetic operations in three kinds of finite fields—prime fields, 
binary fields and optimal extension fields. 


Chapter 3 provides a brief introduction to elliptic curves, and presents different 
methods for representing points and for performing elliptic curve arithmetic. Also 
considered are techniques for accelerating the arithmetic on Koblitz curves and other 
elliptic curves admitting efficiently-computable endomorphisms. 


Chapter 4 describes elliptic curve protocols for digital signatures, public-key en- 
cryption and key establishment, and considers the generation and validation of domain 
parameters and key pairs. The state-of-the-art in algorithms for solving the elliptic 
curve discrete logarithm problem are surveyed. 


Chapter 5 considers selected engineering aspects of implementing elliptic curve 
cryptography in software and hardware. Also examined are side-channel attacks 
where an adversary exploits information leaked by cryptographic devices, including 
electromagnetic radiation, power consumption, and error messages. 


The appendices present some information that may be useful to implementors. Ap- 
pendix A presents specific examples of elliptic curve domain parameters that are 
suitable for cryptographic use. Appendix B summarizes the important standards that 
describe elliptic curve mechanisms. Appendix C lists selected software tools that are 
available for performing relevant number-theoretic calculations. 
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1.5 Notes and further references 


§1.1 

Popular books on modern cryptography include those of Schneier [409], Menezes, van 
Oorschot and Vanstone [319], Stinson [454], and Ferguson and Schneier [136]. These 
books describe the basic symmetric-key and public-key mechanisms outlined in §1.1 
including symmetric-key encryption schemes, MAC algorithms, public-key encryp- 
tion schemes, and digital signature schemes. Practical considerations with deploying 
public-key cryptography on a large scale are discussed in the books of Ford and Baum 
[145], Adams and Lloyd [2], and Housley and Polk [200]. 


§1.2 

The notion of public-key cryptography was introduced by Diffie and Hellman [121] and 
independently by Merkle [321]. A lucid account of its early history and development is 
given by Diffie [120]; for a popular narrative, see Levy’s book [290]. Diffie and Hell- 
man presented their key agreement algorithm using exponentiation in the multiplicative 
group of the integers modulo a prime, and described public-key encryption and digital 
signature schemes using generic trapdoor one-way functions. The first concrete real- 
ization of a public-key encryption scheme was the knapsack scheme of Merkle and 
Hellman [322]. This scheme, and its many variants that have been proposed, have been 
shown to be insecure. 


The RSA public-key encryption and signature schemes are due to Rivest, Shamir and 
Adleman [391]. 


ElGamal [131] was the first to propose public-key encryption and signature schemes 
based on the hardness of the discrete logarithm problem. The Digital Signature Algo- 
rithm, specified in FIPS 186 [139], was invented by Kravitz [268]. Smith and Skinner 
[443], Gong and Harn [176], and Lenstra and Verheul [283] showed, sii seca how 
the elements of the subgroup of sacs p+1 of Fe, the subgroup of order p*+ p+ 1 
of Te and the subgroup of order p? — p+ 1 of F¥6, can be compactly represented. In 
eae systems, more commonly known as LUC, GH, and XTR, respectively, subgroup 
elements have representations that are smaller than the representations of field elements 
by factors of 2, 1.5 and 3, respectively. 


Koblitz [250] and Miller [325] in 1985 independently proposed using the group of 
points on an elliptic curve defined over a finite field to devise discrete logarithm cryp- 
tographic schemes. Two books devoted to the study of elliptic curve cryptography 
are those of Menezes [313] and Blake, Seroussi and Smart [49] published in 1993 
and 1999, respectively. The books by Enge [132] and Washington [474] focus on the 
mathematics relevant to elliptic curve cryptography. 


Other applications of elliptic curves include the integer factorization algorithm of 
Lenstra [285] which is notable for its ability to quickly find any small prime factors 
of an integer, the primality proving algorithm of Goldwasser and Kilian [173], and the 
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pseudorandom bit generators proposed by Kaliski [233]. Koyama, Maurer, Okamoto 
and Vanstone [267] showed how elliptic curves defined over the integers modulo a 
composite integer n could be used to design RSA-like cryptographic schemes where 
the order of the elliptic curve group is the trapdoor. The hardness of factoring n is 
necessary for these schemes to be secure, and hence n should be the same bitlength 
as the modulus used in RSA systems. The work of several people including Kuro- 
sawa, Okada and Tsujii [273], Pinch [374], Kaliski [236] and Bleichenbacher [52] has 
shown that these elliptic curve analogues offer no significant advantages over their RSA 
counterparts. 


There have been many other proposals for using finite groups in discrete logarithm 
cryptographic schemes. These include the group of units of the integers modulo a com- 
posite integer by McCurley [310], the jacobian of a hyperelliptic curve over a finite field 
by Koblitz [251], the jacobian of a superelliptic curve over a finite field by Galbraith, 
Paulus and Smart [157], and the class group of an imaginary quadratic number field by 
Buchmann and Williams [80]. Buchmann and Williams [81] (see also Scheidler, Buch- 
mann and Williams [405]) showed how a real quadratic number field which yields a 
structure that is ‘almost’ a group can be used to design discrete logarithm schemes. 
Analogous structures for real quadratic congruence function fields were studied by 
Scheidler, Stein and Williams [406], and Miiller, Vanstone and Zuccherato [336]. 


§1.3 

The number field sieve (NFS) for factoring integers was first proposed by Pollard [380], 
and is described in the book edited by Lenstra and Lenstra [280]. Cavallar et al. [87] 
report on their factorization using the NFS of a 512-bit RSA modulus. 


Pollard’s rho algorithm is due to Pollard [379]. The number field sieve (NFS) for com- 
puting discrete logarithms in prime fields was proposed by Gordon [178] and improved 
by Schirokauer [408]. Joux and Lercier [228] discuss further improvements that were 
used in their computation in 2001 of discrete logarithms in a 397-bit (120-decimal 
digit) prime field. The fastest algorithm for computing discrete logarithms in binary 
fields is due to Coppersmith [102]. The algorithm was implemented by Thomé [460] 
who succeeded in 2001 in computing logarithms in the 607-bit field F607. 


The Certicom ECCp-109 challenge [88] was solved in 2002 by a team of contribu- 
tors led by Chris Monico. The method used was the parallelized version of Pollard’s 
rho algorithm as proposed by van Oorschot and Wiener [463]. The ECCp-109 chal- 
lenge asked for the solution of an ECDLP instance in an elliptic curve defined over a 
109-bit prime field. The effort took 549 days and had contributions from over 10,000 
workstations on the Internet. 


The equivalent key sizes for ECC and DSA parameters in Table 1.1 are from FIPS 186- 
2 [140] and NIST Special Publication 800-56 [342]. These comparisons are generally 
in agreement with those of Lenstra and Verheul [284] and Lenstra [279], who also 
consider cost-equivalent key sizes. Customized hardware designs for lowering the full 
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cost of the matrix stage were proposed and analyzed by Bernstein [41], Wiener [481], 
and Lenstra, Shamir, Tomlinson and Tromer [282]. Customized hardware designs for 
lowering the full cost of sieving were proposed by Shamir [421] (see also Lenstra 
and Shamir [281]), Geiselmann and Steinwandt [169], and Shamir and Tromer [423]. 
Shamir and Tromer [423] estimate that the sieving stage for a 1024-bit RSA modulus 
can be completed in less than a year by a machine that would cost about US $10 million 
to build, and that the matrix stage is easier. 


§1.4 

Readers can stay abreast of the latest developments in elliptic curve cryptography and 
related areas by studying the proceedings of the annual cryptography conferences 
including ASIACRYPT, CRYPTO, EUROCRYPT, INDOCRYPT, the Workshop on 
Cryptographic Hardware and Embedded Systems (CHES), the International Workshop 
on Practice and Theory in Public Key Cryptography (PKC), and the biennial Algorith- 
mic Number Theory Symposium (ANTS). The proceedings of all these conferences are 
published by Springer-Verlag in their Lecture Notes in Computer Science series, and 
are conveniently available online at http://link.springer.de/link/service/series/0558/. 
Another important repository for the latest research articles in cryptography is the 
Cryptology ePrint Archive website at http://eprint.iacr.org/. 
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CHAPTER ps 





Finite Field Arithmetic 


The efficient implementation of finite field arithmetic is an important prerequisite in 
elliptic curve systems because curve operations are performed using arithmetic op- 
erations in the underlying field. §2.1 provides an informal introduction to the theory 
of finite fields. Three kinds of fields that are especially amenable for the efficient 
implementation of elliptic curve systems are prime fields, binary fields, and optimal 
extension fields. Efficient algorithms for software implementation of addition, subtrac- 
tion, multiplication and inversion in these fields are discussed at length in §2.2, §2.3, 
and §2.4, respectively. Hardware implementation is considered in §5.2 and chapter 
notes and references are provided in §2.5. 


2.1 Introduction to finite fields 


Fields are abstractions of familiar number systems (such as the rational numbers Q, the 
real numbers R, and the complex numbers C) and their essential properties. They con- 
sist of a set F together with two operations, addition (denoted by +) and multiplication 
(denoted by -), that satisfy the usual arithmetic properties: 


(i) (F,+) is an abelian group with (additive) identity denoted by 0. 





(ii) CF \ {0}, -) is an abelian group with (multiplicative) identity denoted by 1. 
(iii) The distributive law holds: (a+b)-c=a-c+b-c foralla,b,céF. 
If the set F is finite, then the field is said to be finite. 


This section presents basic facts about finite fields. Other properties will be presented 
throughout the book as needed. 
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Field operations 


A field F is equipped with two operations, addition and multiplication. Subtraction of 
field elements is defined in terms of addition: for a,b € F, a—b =a+(—b) where 
—b is the unique element in F such that b+ (—b) = 0 (—b is called the negative of b). 
Similarly, division of field elements is defined in terms of multiplication: for a,b € F 
with b £0, a/b =a-b7! where b™! is the unique element in F such that b-b~! = 1. 
(b~! is called the inverse of b.) 





Existence and uniqueness 


The order of a finite field is the number of elements in the field. There exists a finite 
field F of order q if and only if g is a prime power, i.e., g = p™ where p is a prime 
number called the characteristic of F, and m is a positive integer. If m = 1, then F is 
called a prime field. If m > 2, then F is called an extension field. For any prime power 
q, there is essentially only one finite field of order g; informally, this means that any 
two finite fields of order g are structurally the same except that the labeling used to 
represent the field elements may be different (cf. Example 2.3). We say that any two 
finite fields of order g are isomorphic and denote such a field by F 9. 


Prime fields 


Let p be a prime number. The integers modulo p, consisting of the integers 
{0,1,2,..., p — 1} with addition and multiplication performed modulo p, is a finite 
field of order p. We shall denote this field by F’, and call p the modulus of F ,. For any 
integer a, a mod p shall denote the unique integer remainder r, 0 <r < p—1, obtained 
upon dividing a by p; this operation is called reduction modulo p. 


Example 2.1 (prime field F29) The elements of F9 are {0, 1, 2,..., 28}. The following 
are some examples of arithmetic operations in F9. 


(i) Addition: 17+ 20 = 8 since 37 mod 29 = 8. 

(ii) Subtraction: 17 — 20 = 26 since —3 mod 29 = 26. 
(iii) Multiplication: 17-20 = 21 since 340 mod 29 = 21. 
(iv) Inversion: 17~! = 12 since 17-12 mod 29 = 1. 


Binary fields 


Finite fields of order 2” are called binary fields or characteristic-two finite fields. One 
way to construct IF» is to use a polynomial basis representation. Here, the elements 
of Fo are the binary polynomials (polynomials whose coefficients are in the field 
F2 = {0, 1}) of degree at most m — 1: 


Fon = {am—12"—! tam_22" 7 ++ Fane +aiz+ao : aj € {0, 1}}. 
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An irreducible binary polynomial f(z) of degree m is chosen (such a polynomial exists 
for any m and can be efficiently found; see §A.1). Irreducibility of f(z) means that 
f(z) cannot be factored as a product of binary polynomials each of degree less than 
m. Addition of field elements is the usual addition of polynomials, with coefficient 
arithmetic performed modulo 2. Multiplication of field elements is performed modulo 
the reduction polynomial f(z). For any binary polynomial a(z), a(z) mod f(z) shall 
denote the unique remainder polynomial r(z) of degree less than m obtained upon long 
division of a(z) by f(z); this operation is called reduction modulo f (z). 


Example 2.2 (binary field F,4) The elements of F,4 are the 16 binary polynomials of 
degree at most 3: 


34.72 

















0 22 3 Zz 

1 +1 21 ee +l 
Zz +z 2+z o+z74+2z 
z+1 ereel 2+ztl 2g el 





The following are some examples of arithmetic operations in Fj4 with reduction 
polynomial f(z) = z++z+1. 
(i) Addition: (23+27+D)+(24+z4+4D=24z. 
(ii) Subtraction: (23+ z* +1) — (27 +z+1) =z> +z. (Note that since —1 = 1 in F2, 
we have —a =a for all a € Fo.) 


(iii) Multiplication: (z3 + z*+1)-(z2+z+1) =z? +1 since 











(P+2°4+1)-(@?+24+Dao4+ct1 


and 





(2? +z+1) mod (4+24+1) =27 41. 


(iv) Inversion: (z> + z*+1)7! = z? since (27+ z7+1)-z? mod (z4+z4+1 =1. 


Example 2.3 (isomorphic fields) There are three irreducible binary polynomials of de- 
gree 4, namely fi(z)=z4+z41, f(g)=zte34 Land f3(z) = 244234224241. 
Each of these reduction polynomials can be used to construct the field F54; let’s call 
the resulting fields K,, K2 and K3. The field elements of K,, K2 and K3 are the same 
16 binary polynomials of degree at most 3. Superficially, these fields appear to be dif- 
ferent, e.g., O-z=ztlin Ky, 2-z=24+1 in Ko, and 2 -z=24+277+7z+4+1 in 
K3. However, all fields of a given order are isomorphic—that is, the differences are 
only in the labeling of the elements. An isomorphism between K, and K2 may be con- 
structed by finding c € K2 such that f\(c) =O (mod f2) and then extending z+ c 


to an isomorphism gy : Kj — K2; the choices for c are 24 Zs ge 1, 24 ae and 
gg? +, 
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Extension fields 


The polynomial basis representation for binary fields can be generalized to all exten- 
sion fields as follows. Let p be a prime and m > 2. Let F,[z] denote the set of all 
polynomials in the variable z with coefficients from F,. Let f(z), the reduction poly- 
nomial, be an irreducible polynomial of degree m in F ,[z]—such a polynomial exists 
for any p and m and can be efficiently found (see §A.1). Irreducibility of f(z) means 
that f(z) cannot be factored as a product of polynomials in F,[z] each of degree less 
than m. The elements of F,” are the polynomials in F ,[z] of degree at most m — 1: 





—1 —2 2 
Fp = {an-i1z" + Am—22" ins a2Z alZz ag + GQ € F5}. 





Addition of field elements is the usual addition of polynomials, with coefficient arith- 
metic performed in F,. Multiplication of field elements is performed modulo the 
polynomial f(z). 


Example 2.4 (an extension field) Let p = 251 andm =5. The polynomial f(z) = z>+ 
z+ 41223 +97? +7 is irreducible in F251[z] and thus can serve as reduction polynomial 
for the construction of F515, the finite field of order 2515. The elements of F'5)5 are 
the polynomials in F'25;[z] of degree at most 4. 
The following are some examples of arithmetic operations in F 5s. Leta = 12324 + 
76z7 +7z+4 and b = 196z4 + 1223 + 225z* + 76. 
(i) Addition: a + b = 6824 + 1227 + 50z* +7z +80. 
(ii) Subtraction: a — b = 178z4 + 239z3 + 102z* +7z+ 179. 
(iii) Multiplication: a-b = 117z4 + 15123 +. 117z7 + 18274217. 
(iv) Inversion: a~! = 109z4 + 11123 +2502? + 98z +85. 

















Subfields of a finite field 


A subset k of a field K is a subfield of K if k is itself a field with respect to the 
operations of K. In this instance, K is said to be an extension field of k. The subfields 
of a finite field can be easily characterized. A finite field F,” has precisely one subfield 
of order p! for each positive divisor / of m; the elements of this subfield are the elements 
a € Fp satisfying a? =a. Conversely, every subfield of F,» has order p! for some 
positive divisor / of m. 


Bases of a finite field 


The finite field F,» can be viewed as a vector space over its subfield F,. Here, vectors 
are elements of Fg, scalars are elements of F’,, vector addition is the addition operation 
in Fy”, and scalar multiplication is the multiplication in Fj” of F,-elements with Fg”- 
elements. The vector space has dimension n and has many bases. 





2.2. Prime field arithmetic 29 


If B = {b,, bz, ..., bn} is a basis, then a € Fy” can be uniquely represented by an n- 
tuple (a), a2,..., dy) of F,-elements where a = a,b, +.azb2 +--+ dnb. For example, 
in the polynomial basis representation of the field F,» described above, Fp” is an m- 


dimensional vector space over F’, and {gm—l gm-2 2, Z, 1} is a basis for F pm over 
F 


Pp: 


Multiplicative group of a finite field 


The nonzero elements of a finite field F,, denoted Fo form a cyclic group under 
multiplication. Hence there exist elements b € Fj called generators such that 


Fe = {b' :0<i<q—2}. 


The order of a € F* is the smallest positive integer t such that a’ = 1. Since Fj isa 
cyclic group, it follows that f is a divisor of q —1. 


2.2 Prime field arithmetic 


This section presents algorithms for performing arithmetic in the prime field F,. Algo- 
rithms for arbitrary primes p are presented in §2.2.1—§2.2.5. The reduction step can be 
accelerated considerably when the modulus p has a special form. Efficient reduction 
algorithms for the NIST primes such as p = 2!9* — 2 — 1 are considered in §2.2.6. 

The algorithms presented here are well suited for software implementation. We as- 
sume that the implementation platform has a W-bit architecture where W is a multiple 
of 8. Workstations are commonly 64- or 32-bit architectures. Low-power or inexpen- 
sive components may have smaller W, for example, some embedded systems are 16-bit 
and smartcards may have W = 8. The bits of a W-bit word U are numbered from 0 to 
W —1, with the rightmost bit of U designated as bit 0. 

The elements of F,, are the integers from 0 to p—1. Let m = [log, p] be the 
bitlength of p, and t = [m/W] be its wordlength. Figure 2.1 illustrates the case 
where the binary representation of a field element a is stored in an array A = (A[t — 
1],..., A[2], A[1], A[O]) of t W-bit words, where the rightmost bit of A[O] is the least 
significant bit. 


Aft—1} | +» [AP] A[1] A[0] 
Figure 2.1. Representation of a € Fp as an array A of W-bit words. As an integer, 
a = 2¢-DW Ate — 1] 4--- +22 Ap2] +2 Api] + A[O}. 


Hardware characteristics may favour approaches different from those of the al- 
gorithms and field element representation presented here. §5.1.1 examines possible 
bottlenecks in multiplication due to constraints on hardware integer multipliers and 
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the cost of propagating carries. §5.1.2 briefly discusses the use of floating-point hard- 
ware commonly found on workstations, which can give substantial improvement in 
multiplication times (and uses a different field element representation). Similarly, 
single-instruction multiple-data (SIMD) registers on some processors can be employed; 
see §5.1.3. Selected timings for field operations appear in §5.1.5. 


2.2.1 Addition and subtraction 


Algorithms for field addition and subtraction are given in terms of corresponding al- 
gorithms for multi-word integers. The following notation and terminology is used. An 
assignment of the form “(e, z) <_w” for an integer w is understood to mean 


z<w mod aM and 


e<Oifwe [0,2 5, otherwise ¢ <1. 


Ifw=x+yte’ for x,y €[0, 2”) ande’e {0, 1}, then w = 62” +z and ¢ is called the 
carry bit from single-word addition (with e = 1 if and only if z < x +e’). Algorithm 2.5 
performs addition of multi-word integers. 


Algorithm 2.5 Multiprecision addition 
INPUT: Integers a, b € [0,2). 
OUTPUT: (€,c) where c= a+b mod 2™’ and ¢ is the carry bit. 
1. (e, C[O]) — A[O] + B[O]. 
2. Fori from | to t—1 do 
2.1 (e, C[i]) Ali] + Bli] +e. 
3. Return(é, c). 


On processors that handle the carry as part of the instruction set, there need not 
be any explicit check for carry. Multi-word subtraction (Algorithm 2.6) is similar to 
addition, with the carry bit often called a “borrow” in this context. 


Algorithm 2.6 Multiprecision subtraction 
INPUT: Integers a, b € [0,2). 
OUTPUT: (€,c) where c = a—b mod 2’ and ¢ is the borrow. 
1. (e, C[O]) — A[O] — B[O]. 
2. Fori from 1 to t—1 do 
2.1 (e, C[i]) — Ali] — B[i] -e. 
3. Return(é, c). 
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Modular addition ((x + y) mod p) and subtraction ((x — y) mod p) are adapted di- 
rectly from the corresponding algorithms above, with an additional step for reduction 
modulo p. 


Algorithm 2.7 Addition in F, 


INPUT: Modulus p, and integers a, b € [0, p — 1]. 
OUTPUT: c = (a+b) mod p. 
1. Use Algorithm 2.5 to obtain (¢,c) where c = a+b mod 2’ and ¢ is the carry 
bit. 
2. If e =1, then subtract p from c = (C[t — 1],..., C[2], C[1], C[O]); 
Else if c > p thenc<c— p. 
3. Return(c). 


Algorithm 2.8 Subtraction in F, 


INPUT: Modulus p, and integers a, b € [0, p — 1]. 

OUTPUT: c = (a—b) mod p. 
1. Use Algorithm 2.6 to obtain (€,c) where c = a—b mod 2" and ¢ is the borrow. 
2. Ife =1, then add p toc = (C[t — 1], ..., C[2], C[1], C[O]). 
3. Return(c). 


2.2.2 Integer multiplication 


Field multiplication of a,b € F, can be accomplished by first multiplying a and b as 
integers, and then reducing the result modulo p. Algorithms 2.9 and 2.10 are elemen- 
tary integer multiplication routines which illustrate basic operand scanning and product 
scanning methods, respectively. In both algorithms, (U V) denotes a (2W)-bit quantity 
obtained by concatenation of W-bit words U and V. 


Algorithm 2.9 Integer multiplication (operand scanning form) 


INPUT: Integers a,b € [0, p— 1]. 
OUTPUT: c=a-b. 
1. Set C[i]<-O forO<i<t-—-1l. 
2. For i from 0 to t—1 do 
2.1 U<0. 
2.2 For j from 0 to t—1 do: 
(UV) <—Cfit+s]+Ali]- BLj]+U. 
Clit+j]<V. 
2.3 Cli+t]<U. 
3. Return(c). 
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The calculation C[i + j]+ Ali]. B[j]+U at step 2.2 is called the inner product 
operation. Since the operands are W-bit values, the inner product is bounded by 2(2” — 
1)+(2” —1)? = 22” — | and can be represented by (UV). 

Algorithm 2.10 is arranged so that the product c = ab is calculated right-to-left. As in 
the preceding algorithm, a (2W)-bit product of W-bit operands is required. The values 
Ro, Ri, Ro, U, and V are W-bit words. 


Algorithm 2.10 Integer multiplication (product scanning form) 


INPUT: Integers a,b € [0, p— 1]. 
OUTPUT: c=a-b. 
1. Ro <0, Rj <0, Ro —0. 
2. For k from 0 to 2t —2 do 
2.1 For each element of {(i, 7) |i +j =k, 0<i,j <t—I1}do 
(UV) <Al{i]- BL/]. 
(€, Ro) << Ro+ V. 
(e, Rj) Ry +U +e. 
Ro<Ro+e. 
262 C[k] < Ro, Ro <+ Rj, Ry <+ Rb), Ro <0. 
3. C[2t—1]< Ro. 
4. Return(c). 








Note 2.11 (implementing Algorithms 2.9 and 2.10) Algorithms 2.9 and 2.10 are writ- 
ten in a form motivated by the case where a W-bit architecture has a multiplication 
operation giving a 2W-bit result (e.g., the Intel Pentium or Sun SPARC). A common 
exception is illustrated by the 64-bit Sun UltraSPARC, where the multiplier produces 
the lower 64 bits of the product of 64-bit inputs. One variation of these algorithms splits 
a and b into (W/2)-bit half-words, but accumulates in W-bit registers. See also §5.1.3 
for an example concerning a 32-bit architecture which has some 64-bit operations. 


Karatsuba-Ofman multiplication 


Algorithms 2.9 and 2.10 take O(n”) bit operations for multiplying two n-bit integers. A 
divide-and-conquer algorithm due to Karatsuba and Ofman reduces the complexity to 
O(n'°223), Suppose that n = 2] and x = x;2! +. xo and y = y;2! + yo are 2/-bit integers. 
Then 


xy = (x12! +.x0)(912' + yo) 
=x, + ye" + [(xo +41) Oo +91) — 211 — x0: Yol2! +.xoy0 


and xy can be computed by performing three multiplications of /-bit integers (as op- 
posed to one multiplication with 2/-bit integers) along with two additions and two 
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subtractions.! For large values of /, the cost of the additions and subtractions is in- 
significant relative to the cost of the multiplications. The procedure may be applied 
recursively to the intermediate values, terminating at some threshold (possibly the word 
size of the machine) where a classical or other method is employed. 

For integers of modest size, the overhead in Karatsuba-Ofman may be significant. 
Implementations may deviate from the traditional description in order to reduce the 
shifting required (for multiplications by 2/ and 27/) and make more efficient use of 
word-oriented operations. For example, it may be more effective to split on word 
boundaries, and the split at a given stage may be into more than two fragments. 


Example 2.12 (Karatsuba-Ofman methods) Consider multiplication of 224-bit values 
x and y, ona machine with word size W = 32. Two possible depth-2 approaches are in- 
dicated in Figure 2.2. The split in Figure 2.2(a) is perhaps mathematically more elegant 


224 224 
os SoN 
112 112 96 128 
7X # % Ex dS 
56 56 56 56 32 64 «4644 
(a) n/2 split (b) split on word boundary 


Figure 2.2. Depth-2 splits for 224-bit integers. The product xy using (a) has three 112 x 112 
multiplications, each performed using three 56x56 multiplications. Using (b), xy has a 96x 96 
(split as a 32x32 and two 64x64) and two 128 x 128 multiplications (each generating three 
64x 64 multiplies). 


and may have more reusable code compared with that in Figure 2.2(b). However, more 
shifting will be required (since the splits are not on word boundaries). If multiplication 
of 56-bit quantities (perhaps by another application of Karatsuba-Ofman) has approxi- 
mately the same cost as multiplication of 64-bit values, then the split has under-utilized 
the hardware capabilities since the cost is nine 64-bit multiplications versus one 32-bit 
and eight 64-bit multiplications in (b). On the other hand, the split on word boundaries 
in Figure 2.2(b) has more complicated cross term calculations, since there may be carry 
to an additional word. For example, the cross terms at depth 2 are of the form 





(xo +%1)(Yo + Y1) — X1¥1 — X0¥0 


where xo +x1 and yo + y are 57-bit in (a) and 65-bit in (b). Split (b) costs somewhat 
more here, although (x9 + x1) (yo + y1) can be managed as a 64x 64 mulitply followed 
by two possible additions corresponding to the high bits. 


' The cross term can be written (xq —x1)(v1 — yo) +090 +X] Which may be useful on some platforms 
or if it is known a priori that x9 > x, and yo < yy. 
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192 192 192 


% ae a — 


4 64 


64 6 96 
PX aN ae ae Fl. tle 


32 64 32 64 32 32 32 32 32 32 32 32 32 32 32 32 


(a) binary split (b) 3-way split at depth 1 (c) 3-way split at depth 2 


Figure 2.3. Depth-2 splits for 192-bit integers. The product xy using (a) has three 96x96 mul- 
tiplications. Each is performed with a 3232 and two 64x 64 (each requiring three 32 x 32) 
multiplications, for a total of 21 multiplications of size 32 x 32. Using (b) or (c), only 18 
multiplications of size 32 x32 are required. 


As a second illustration, consider Karatsuba-Ofman applied to 192-bit integers, 
again with W = 32. Three possible depth-2 approaches are given in Figure 2.3. In 
terms of 32x 32 multiplications, the split in Figure 2.3(a) will require 21, while (b) and 
(c) use 18. The basic idea is that multiplication of 3/-bit integers x = x22! +. x42! +29 
and y = y727! + y,2! + yo can be done as 


xy = (x227 +212! + x0) + (y227 + 12! + yo) 








= x2y22 + (xoy1 +x1y2)2* + (x2 yo + x0y2 + x1 1)27 





(x1 y0 + xoy1)2' +. xoyo 








= x2 + yo" + [(x2 +21) - 2 +: 1) — 292 — 1 12 








[(x2 +x0) «(v2 + yo) — x22 — x0* yo F x11 127 














[(x1 +x0) (1 + yo) — 2191 — X0Y0]2! +.x0¥0 
for a total of six multiplications of /-bit integers. 


The performance of field multiplication is fundamental to mechanisms based on 
elliptic curves. Constraints on hardware integer multipliers and the cost of carry propa- 
gation can result in significant bottlenecks in direct implementations of Algorithms 2.9 
and 2.10. As outlined in the introductory paragraphs of §2.2, Chapter 5 discusses 
alternative strategies applicable in some environments. 


2.2.3 Integer squaring 


Field squaring of a € F,, can be accomplished by first squaring a as an integer, and then 
reducing the result modulo p. A straightforward modification of Algorithm 2.10 gives 
the following algorithm for integer squaring, reducing the number of required single- 
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precision multiplications by roughly half. In step 2.1, a (2W + 1)-bit result (e, UV) is 
obtained from multiplication of the (2W)-bit quantity (UV) by 2. 


Algorithm 2.13 Integer squaring 


INPUT: Integer a € [0, p— 1]. 
OUTPUT: c =a’. 
1. Ro < 0, R1 < 0, Ro < 0. 
2. For k from 0 to 2t —2 do 
2.1 For each element of {(i, 7) |i +j =k, O<i<j <t—I1}do 
(UV) <Ali]- A[/]. 
If (i < j) then do: (¢, UV) —(UV)-2, Roa< Ro +6. 
(e, Ro) < Ro + VY. 
(€, Rj) <R,+U+e. 
Ro<Ro+e. 
2.2 C[k]< Ro, Ro< Ri, Ri <— Ro, Ro <0. 
3. C[2t—1]< Ro. 
4. Return(c). 








The multiplication by 2 in step 2.1 may be implemented as two single-precision 
shift-through-carry (if available) or as two single-precision additions with carry. The 
step can be rewritten so that each output word C[k] requires at most one multiplication 
by 2, at the cost of two additional accumulators and an associated accumulation step. 


2.2.4 Reduction 


For moduli p that are not of special form, the reduction z mod p can be an expen- 
sive part of modular multiplication. Since the performance of elliptic curve schemes 
depends heavily on the speed of field multiplication, there is considerable incentive to 
select moduli, such as the NIST-recommended primes of §2.2.6, that permit fast reduc- 
tion. In this section, we present only the reduction method of Barrett and an overview 
of Montgomery multiplication. 

The methods of Barrett and Montgomery are similar in that expensive divisions 
in classical reduction methods are replaced by less-expensive operations. Barrett re- 
duction can be regarded as a direct replacement for classical methods; however, an 
expensive modulus-dependent calculation is required, and hence the method is ap- 
plicable when many reductions are performed with a single modulus. Montgomery’s 
method, on the other hand, requires transformations of the data. The technique can be 
effective when the cost of the input and output conversions is offset by savings in many 
intermediate multiplications, as occurs in modular exponentiation. 

Note that some modular operations are typically required in a larger framework such 
as the signature schemes of §4.4, and the moduli involved need not be of special form. 
In these instances, Barrett reduction may be an appropriate method. 
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Barrett reduction 


Barrett reduction (Algorithm 2.14) finds z mod p for given positive integers z and p. 
In contrast to the algorithms presented in §2.2.6, Barrett reduction does not exploit any 
special form of the modulus p. The quotient |z/p| is estimated using less-expensive 
operations involving powers of a suitably-chosen base b (e.g., b = 2” for some L which 
may depend on the modulus but not on z). A modulus-dependent quantity [b** / p| 
must be calculated, making the algorithm suitable for the case that many reductions are 
performed with a single modulus. 


Algorithm 2.14 Barrett reduction 
INPUT: p, b > 3,k = [log, p| +1,0<z< b*, and uw = [b*/p]. 
OUTPUT: z mod p. 

1. G< |Lz/bF!] w/b]. 

2. r<—(z mod bk+!) — (G- p mod Df), 

3. Ifr <Othenr <—r+bd*t!, 

4. While r > p do: r<r-—p. 

5. Return(r). 


Note 2.15 (correctness of Algorithm 2.14) Let q = |z/p];thenr =z mod p=z—gqp. 
Step 1 of the algorithm calculates an estimate g to qg since 


Note that 


The following argument shows that g — 2 < @ < q; that is, 7 is a good estimate for q. 
Define 
Zz z pe | pF 
CS |e |, PS 
pk-1 pk-1 P P 
Then 0 <a, 6 < 1 and 


a 


pk+1 


IA 


Lard | Lert lJ +1 
pk+l pk+ ; 
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Since z < b** and p= bk! it follows that 


z pk 
Ler] ] S| +i sot —peot e120 
P 


The value r calculated in step 2 necessarily satisfies r = z—@p (mod b**!) with 
Ir| < b'+!. Hence 0 <r < b+! and r = z—@Gp mod b+! after step 3. Now, since 
0 < z-—@qp < p, we have 


0<z-—gp <z—(q—2)p < 3p. 


Since b > 3 and p <b‘, we have 3p < b‘+!. Thus 0 < z—Gp <b**!, and sor =z—Gp 
after step 3. Hence, at most two subtractions at step 4 are required to obtain 0 <r < p, 
and then r = z mod p. 


Note 2.16 (computational considerations for Algorithm 2.14) 


(i) 


(ii) 


(iii) 


(iv) 


A natural choice for the base is b = 24 where L is near the word size of the 
processor. 

Other than the calculation of ~ (which is done once per modulus), the divisions 
required are simple shifts of the base-b representation. 

Let z’ = [z/b*~']. Note that z’ and yz have at most k+1 base-b digits. The 
calculation of g in step 1 discards the k + 1 least-significant digits of the product 
z'w. Given the base-b representations z’ = )~z/b' and yp = )° w;b/, write 


jus ¥(X sa) 


1=0 \i+j=l 
ne, 
w] 
where w; may exceed b— 1. If b>k—1, then Ss, 9 wib < b**! and hence 
, 2k 1 k-2 1 
Zp wb wb 
OS Fei 2s peti =) Ga a 
l=k-1 1=0 


It follows that ee k-1 wb! / | underestimates g by at most | if b> k— 
1. At most Cs +k = (k? +5k +2) /2 single-precision multiplications (i.e., 
multiplications of values less than b) are required to find this estimate for @. 


Only the k + 1 least significant digits of 7- p are required at step 2. Since p < b*, 
the k + 1 digits can be obtained with ey +k single-precision multiplications. 
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Montgomery multiplication 


As with Barrett reduction, the strategy in Montgomery’s method is to replace division 
in classical reduction algorithms with less-expensive operations. The method is not ef- 
ficient for a single modular multiplication, but can be used effectively in computations 
such as modular exponentiation where many multiplications are performed for given 
input. For this section, we give only an overview (for more details, see §2.5). 

Let R > p with gcd(R, p) = 1. Montgomery reduction produces zR~! mod p for an 
input z < pR. We consider the case that p is odd, so that R = 2’ may be selected and 
division by R is relatively inexpensive. If p’ = —p~! mod R, then c = zR~! mod p 
may be obtained via 


c<—(z+(zp’ mod R)p)/R, 
ifc > pthenc<c— p, 


with t(t +1) single-precision multiplications (and no divisions). 

Given x €[0, p),let* =xR mod p. Note that (¥)R~! mod p=(xy)R mod p; that 
is, Montgomery reduction can be used in a multiplication method on representatives x. 
We define the Montgomery product of ¥ and ¥ to be 


Mont (x, ¥) =XVR~! mod p =xyR mod p. (2.1) 


A single modular multiplication cannot afford the expensive transformations x +> ¥ = 
xR mod p and ¥ ++ ¥R~! mod p = x; however, the transformations are performed 
only once when used as part of a larger calculation such as modular exponentiation, as 
illustrated in Algorithm 2.17. 


Algorithm 2.17 Montgomery exponentiation (basic) 
INPUT: Odd modulus p, R =2™', p’ =—p7! mod R, x € [0, p), e = (e],..., €0)2. 
OUTPUT: x° mod p. 
1. ¥<xR mod p, A<R mod p. 
2. For i from / downto 0 do 
2.1 A< Mont(A, A). 
2.2 If e; = 1 then A <— Mont(A, x). 
3. Return(Mont(A, 1)). 


As a rough comparison, Montgomery reduction requires ¢(t + 1) single-precision 
multiplications, while Barrett (with b = 2”) uses t(t +4) + 1, and hence Montgomery 
methods are expected to be superior in calculations such as general modular expo- 
nentiation. Both methods are expected to be much slower than the direct reduction 
techniques of §2.2.6 for moduli of special form. 
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Montgomery arithmetic can be used to accelerate modular inversion methods that 
use repeated multiplication, where a~! is obtained as a?~? mod p (since a?~! = 1 
(mod p) if gcd(a, p) = 1). Elliptic curve point multiplication ($3.3) can benefit from 
Montgomery arithmetic, where the Montgomery inverse discussed in §2.2.5 may also 
be of interest. 


2.2.5 Inversion 


Recall that the inverse of a nonzero element a € F,, denoted a~! mod p or simply a7! 


if the field is understood from context, is the unique element x € F, such that ax = | 
in Fy, ie., ax = 1 (mod p). Inverses can be efficiently computed by the extended 
Euclidean algorithm for integers. 


The extended Euclidean algorithm for integers 


Let a and b be integers, not both 0. The greatest common divisor (gcd) of a and b, 
denoted gcd(a, b), is the largest integer d that divides both a and b. Efficient algorithms 
for computing gcd(a, b) exploit the following simple result. 


Theorem 2.18 Let a and b be positive integers. Then gcd(a, b) = gcd(b — ca, a) for 
all integers c. 


In the classical Euclidean algorithm for computing the gcd of positive integers a and 
b where b > a, b is divided by a to obtain a quotient g and a remainder r satisfying 
b=qa+rand0 <r <a. By Theorem 2.18, gcd(a, b) = gcd(r, a). Thus, the problem 
of determining gcd(a, b) is reduced to that of computing gcd(r, a) where the arguments 
(r,a) are smaller than the original arguments (a, b). This process is repeated until one 
of the arguments is 0, and the result is then immediately obtained since gcd(0, d) = d. 
The algorithm must terminate since the non-negative remainders are strictly decreasing. 
Moreover, it is efficient because the number of division steps can be shown to be at most 
2k where k is the bitlength of a. 

The Euclidean algorithm can be extended to find integers x and y such that ax +by = 
d where d = gcd(a, b). Algorithm 2.19 maintains the invariants 


ax; +byj;=u, axot+tby=v, uK<v. 


The algorithm terminates when u = 0, in which case v = gcd(a, b) and x = x2, y = y2 
satisfy ax +by =d. 
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Algorithm 2.19 Extended Euclidean algorithm for integers 


INPUT: Positive integers a and b witha < b. 
OUTPUT: d = gcd(a, b) and integers x, y satisfying ax + by =d. 
l. u<a,v<b. 
2. x1 < l,y< 0, x2 < 0, yo < 1. 
3. While u 40 do 
3.1 q<|v/uj,r<—vu—qu, x <—x2—qXx1, Y<y2—-qY}. 
3.2 vu, uU<—Tr, X2<—X1,X1<— xX, YAS V1, YI<y. 
4. d<v,x<—x2, y< yp. 
5. Return(d, x, y). 














Suppose now that p is prime and a € [1, p— 1], and hence gcd(a, p) = 1. If Al- 
gorithm 2.19 is executed with inputs (a, p), the last nonzero remainder r encountered 
in step 3.1 is r = 1. Subsequent to this occurrence, the integers u, x; and y; as up- 
dated in step 3.2 satisfy ax; + py; =u with u = 1. Hence ax; = 1 (mod p) and so 
a~'! =x, mod p. Note that y; and y2 are not needed for the determination of x;. These 
observations lead to Algorithm 2.20 for inversion in Fp. 


Algorithm 2.20 Inversion in F, using the extended Euclidean algorithm 


INPUT: Prime p anda €é [1, p—1]. 
Output: a7! mod p. 
1. u<a,vu<p. 
2. x1 <1,x2 <0. 
3. While u 4 1 do 
3.1 q<|v/ul,r<—u—qu, x —x2— qx}. 
3.2 v<u,u<—r, x2 <—X1, xX] <—X. 
4. Return(x; mod p). 








Binary inversion algorithm 


A drawback of Algorithm 2.20 is the requirement for computationally expensive divi- 
sion operations in step 3.1. The binary inversion algorithm replaces the divisions with 
cheaper shifts (divisions by 2) and subtractions. The algorithm is an extended version 
of the binary gcd algorithm which is presented next. 

Before each iteration of step 3.1 of Algorithm 2.21, at most one of u and v is odd. 
Thus the divisions by 2 in steps 3.1 and 3.2 do not change the value of gcd(u, v). In 
each iteration, after steps 3.1 and 3.2, both u and v are odd and hence exactly one of 
u and v will be even at the end of step 3.3. Thus, each iteration of step 3 reduces the 
bitlength of either wu or v by at least one. It follows that the total number of iterations 
of step 3 is at most 2k where k is the maximum of the bitlengths of a and b. 
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Algorithm 2.21 Binary gcd algorithm 


INPUT: Positive integers a and b. 
OUTPUT: gcd(a, b). 
l. u<a,v<b,e<1. 
2. While both u and v are even do: u <u/2, v<v/2, e<2e. 
3. While u 40 do 
3.1 While wu is even do: u<u/2. 
3.2 While v is even do: v —v/2. 
3.3 Ifu>vthenu<u—v; else v<v—u. 
4. Return(e- v). 


Algorithm 2.22 computes a~! mod p by finding an integer x such that ax + py = 1. 
The algorithm maintains the invariants 


ax{; + py, =u, dax2+py2=v 


where y; and y2 are not explicitly computed. The algorithm terminates when u = 1 or 
v = 1. In the former case, ax; + py; = 1 and hence a~! = x; mod p. In the latter case, 
ax. + py2 = 1 and a7! = x2 mod p. 


Algorithm 2.22 Binary algorithm for inversion in F , 


INPUT: Prime p anda eé[1, p—1]. 
Output: a7! mod p. 
l. u<a,v<p. 
2. x17 <1, x2 <0. 
3. While (u #1 and v £1) do 
3.1 While u is even do 
u<u/2. 
If x; is even then x; <—x 1/2; else x1 — (x1 + p)/2. 
3.2 While v is even do 
v<ov/2. 
If x2 is even then x2 —x2/2; else x2 <— (x2 + p)/2. 
3.3 Ifu> v then: u<u—v, x1 <x, — X23 
Else: v <u —u, x2 <—x2 — X}. 
4. If u = 1 then return(x; mod p); else return(x2 mod p). 











A division algorithm producing b/a = ba! mod p can be obtained directly from the 
binary algorithm by changing the initialization condition x; <1 to x; <b. The running 
times are expected to be the same, since x; in the inversion algorithm is expected to be 
full-length after a few iterations. Division algorithms are discussed in more detail for 
binary fields (§2.3) where the lower cost of inversion relative to multiplication makes 
division especially attractive. 
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Algorithm 2.22 can be converted to a two-stage inversion method that first finds 
a~'2* mod p for some integer k > 0 and then solves for a~!. This alternative is sim- 
ilar to the almost inverse method (Algorithm 2.50) for inversion in binary fields, and 
permits some optimizations not available in a direct implementation of Algorithm 2.22. 
The basic method is outlined in the context of the Montgomery inverse below, where 
the strategy is particularly appropriate. 


Montgomery inversion 


As outlined in §2.2.4, the basic strategy in Montgomery’s method is to replace modular 
reduction z mod p by a less-expensive operation zR~! mod p for a suitably chosen R. 
Montgomery arithmetic can be regarded as operating on representatives x =x R mod p, 
and is applicable in calculations such as modular exponentiation where the required 
initial and final conversions x +> ¥ and ¥ + XR~! mod p = x are an insignificant 
portion of the overall computation. 

Let p > 2 be an odd (but possibly composite) integer, and define n = [log, p]. 
The Montgomery inverse of an integer a with gcd(a, p) = 1 is a~!2” mod p. Algo- 
rithm 2.23 is a modification of the binary algorithm (Algorithm 2.22), and computes 
a~'2* mod p for some integer k € [n, 2n]. 


Algorithm 2.23 Partial Montgomery inversion in F , 
INPUT: Odd integer p > 2, a € [1, p—1], andn = [log, p]. 
OUTPUT: Either “not invertible” or (x, k) where n < k < 2n and x =a~!2* mod D. 
l. u<a,v<p, x, <1, x2 <0,k <0. 
2. While v > 0 do 
2.1 If v is even then v —v/2, x} 2x}; 
else if uw is even then u <—u/2, x2 <—2x2; 
else if v > u then v <<—(v—u)/2, x2 —x2 +X], X1 221; 
else u <—(u —v)/2, x1 —x2 +X], X2 << 2x2. 
2.2 k<k+. 
3. If uw #1 then return(“not invertible’’). 
4. If xy > p then x; —x, —p. 
5. Return(x1,k). 





For invertible a, the Montgomery inverse a~!2” mod p may be obtained from the 
output (x, k) by k —n repeated divisions of the form: 


if x is even then x <x /2; else x <—(x+ p)/2. (2.2) 


Compared with the binary method (Algorithm 2.22) for producing the ordinary inverse, 
Algorithm 2.23 has simpler updating of the variables x; and x2, although k —n of the 
more expensive updates occur in (2.2). 
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Note 2.24 (correctness of and implementation considerations for Algorithm 2.23) 
(i) In addition to gcd(u, v) = gcd(a, p), the invariants 
ax, = 0” (mod p) and ax2= —y2* (mod p) 


are maintained. If gcd(a, p) = 1, then u = 1 and x; = a~'2* (mod p) at the last 
iteration of step 2. 


(ii) Until the last iteration, the conditions 


P=vx,+ux2., x >1, vel, OK<uK<a, 





hold, and hence x1, v € [1, p]. At the last iteration, xj <-2x, < 2p; if gcd(a, p) = 
1, then necessarily x1 < 2p and step 4 ensures x; < p. Unlike Algorithm 2.22, 
the variables x; and x2 grow slowly, possibly allowing some implementation 
optimizations. 

(iii) Each iteration of step 2 reduces the product wv by at least half and the sum u + v 
by at most half. Initially u+v =a-+ p and uv = ap, and u = v = 1 before the final 
iteration. Hence (a + p)/2 < 2‘—! < ap, and it follows that 2”~? < 2k-! <2?” 
andn <k <2n. 


Montgomery arithmetic commonly selects R = 2“' > 2” for efficiency and uses 
representatives x = xR mod p. The Montgomery product Mont(x, y) of X and Y is as 
defined in (2.1). The second stage (2.2) can be modified to use Montgomery multipli- 
cation to produce a~! mod p or a~!R mod p (rather than a~!2” mod p) from a, or 
to calculate a~! R mod p when Algorithm 2.23 is presented with @ rather than a. Al- 
gorithm 2.25 is applicable in elliptic curve point multiplication (§3.3) if Montgomery 
arithmetic is used with affine coordinates. 


Algorithm 2.25 Montgomery inversion in F , 
INPUT: Odd integer p > 2, n = [log, pl, R? mod p, and @ = aR mod p with 
gcd(a, p)=1. 
OutTpuT: a~!R mod p. 
1. Use Algorithm 2.23 to find (x, k) where x = a—!2* mod pandn <k <2n. 
2. Ifk < Wt then 
2.1 x < Mont(x, R?) = a7!2* mod p. 
2.2 k<k+Wt. {Now,k > Wt.} 
3. x <— Mont(x, R*) = a~!2* mod D- 
4. x — Mont(x, 2?'-*) = a! R mod p. 
5. Return(x). 


The value a~! R = R7/(aR) (mod p) may also be obtained by a division algorithm 
variant of Algorithm 2.22 with inputs R? mod p and @. However, Algorithm 2.25 may 
have implementation advantages, and the Montgomery multiplications required are 
expected to be relatively inexpensive compared to the cost of inversion. 
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Simultaneous inversion 


Field inversion tends to be expensive relative to multiplication. If inverses are required 
for several elements, then the method of simultaneous inversion finds the inverses with 
a single inversion and approximately three multiplications per element. The method is 
based on the observation that 1/x = y(1/xy) and 1/y = x(1/xy), which is generalized 
in Algorithm 2.26 to k elements. 


Algorithm 2.26 Simultaneous inversion 


INPUT: Prime p and nonzero elements ay, ..., ax in Fp 
OUTPUT: Field elements ae ‘ wiles where aja; | =1 (mod p). 
1. Ci —a\. 


2. Fori from 2 to k do: cj <c;—1a; mod p. 
3. uc, mod p. 
4. For i from k downto 2 do 
4.1 i <ucj—1 mod p. 
4.2 u<ua; mod p. 
: a <u. 


6. Return(a;',...,4,'). 


Nn 


For k elements, the algorithm requires one inversion and 3(k — 1) multiplications, 
along with k elements of temporary storage. Although the algorithm is presented in 
the context of prime fields, the technique can be adapted to other fields and is superior 
to k separate inversions whenever the cost of an inversion is higher than that of three 
multiplications. 


2.2.6 NIST primes 


The FIPS 186-2 standard recommends elliptic curves over the five prime fields with 
moduli: 


pion = 2192 — 24 —1 


po = 274 — 2% 41 
9224 4 9192 4 996 _ 4 





p26 = 27° 
— 7384 9128 496 4 932 y 





P384 
pso1 = 2°"! 1. 


These primes have the property that they can be written as the sum or difference of a 
small number of powers of 2. Furthermore, except for 521, the powers appearing in 
these expressions are all multiples of 32. These properties yield reduction algorithms 
that are especially fast on machines with wordsize 32. 
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For example, consider p = pi92 = 2192 _ 964 _ 1. and let c be an integer with 0 < 


c < p’. Let 





eS ego? eno og be? 0” (2.3) 


be the base-2 representation of c, where each c; € [0, 2° — 1]. We can then reduce 
the higher powers of 2 in (2.3) using the congruences 


gis =o 1 tmiod p) 


2256 — 9128 4.964 (nod p) 


2°). = 98 OFT imed-p); 





We thus obtain 
2 
c= c52!28 c52™ C5 
2: 
128 42 








c42 
i C3 7 TC3 


c22!8 4. ¢12% +c (mod p). 














Hence, c modulo p can be obtained by adding the four 192-bit integers c52!*8 + ¢52% + 
és, cg2!® + 2, 032 +3 and c72!78 4+ ¢)2% +o, and repeatedly subtracting p 
until the result is less than p. 


Algorithm 2.27 Fast reduction modulo pj92 = 2!°? — 2 — 1 
INPUT: An integer c = (c5, C4, C3, C2, C], Co) In base 2% withO<c< Pom: 
OUTPUT: c mod pjo2. 
1. Define 192-bit integers: 
81 = (C2,€1,C0), 82 = (0, ¢3,¢3), 
53 = (€4,€4,0), 54 = (C5, 5, C5). 
2. Return(s; +52 +53 +54 mod P192)- 


Algorithm 2.28 Fast reduction modulo p24 = 2774 — Pee | 
INPUT: An integer c = (c13,..., C2, C1, Co) in base 22 withO<c< oe 
OUTPUT: c mod p24. 
1. Define 224-bit integers: 
S1 = (C6, C5, C4, €3,€2,C€1,C0), 82 = (C10, C9, Cg, €7, 0, 0, 0), 
s3 = (0, €13, €12,¢11, 0, 0,0), 54 = (€13, C12, C11, C10, C9, C8, C7), 
55 = (0,0, 0,0, €13, C12, €11). 
2. Return(s; +52 +53 —s4— $5 mod P2224). 
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Algorithm 2.29 Fast reduction modulo p256 = 27°° — 2774 +. 2!9 + 2% — 1 


B2uas 2 
2°* with O < ¢ < Djs. 


INPUT: An integer c = (c15,..., C2, C1, Co) in base 
OUTPUT: c mod p56. 
1. Define 256-bit integers: 
S$] = (C7, C6, C5, C4, C3, C2, C1, CO), 
82 = (C15, C14, C13, C12, C11, 9, 0,0), 
53 = (0, C15, C14, C13, C12, 0, 0, 0), 
84 = (c15, C14, 0, 0,0, c10, C9, €g), 
85 = (Cg, C13, C15, C14, C13, C11, C10, C9), 
86 = (c10, cg, 0, 0, 0, c13, C12, C11), 
87 = (c11, C9, 0, 0, €15, C14, C13, C12), 
5g = (12,0, C10, C9, Cg, C15, C14, C13) 
So = (13,0, C11, C10, C9, 0, C15, C14). 
2. Return(s; + 252 + 253 +54 +55 — 56 —S7 —sg — Sq mod P256)- 








Algorithm 2.30 Fast reduction modulo p3g4 = 2°84 — 2!78 — 2% 4.232 — 1 


INPUT: An integer c = (€23,...,€2, C1, Cg) in base 2°” with 0 <c < Pix: 
OUTPUT: c mod p38q. 
1. Define 384-bit integers: 
$1 = (C11, C10, €9, C8, C7, C6, C5, C4, C3, C2, C1, CO), 
so = (0,0, 0, 0, 0, c23, c22, €21, 0, 0, 0, 0), 
$3 = (C23, C22, C21, C20, C195 C185 C17s C165 C15, C14; C13, C12); 
$4 = (C20, C195 C185 C17s C165 C155 C145 C135 C12, €23, C22, C21); 
85 = (C19, C18, C17; C16 C155 C14; C13, C125 C20, 0, C23, 0), 
56 = (0,0, 0, 0, c23, C22, €21, C29, 0, 0, 0, 0), 
s7 = (0,0, 0, 0, 0, 0, c23, c22, C21, 0, 0, €20), 
Sg = (C22, C21, C20, C19, C18, C17, C16, C15, C14, C13, C12, C23), 
sg = (0,0, 0, 0, 0, 0, 0, c23, €22, C21, €20, 0), 
519 = (0, 0, 0, 0, 0, 0, 0, c23, €23, 0, 0, 0). 
2. Return(s; + 252-4 S53 +54 +585 +86 +87 —S83—S9— S10 mod D384). 





Algorithm 2.31 Fast reduction modulo ps2; = 2°*! — 1 


INPUT: An integer c = (cj041,.--, €2, C1, Cg) in base 2 withO <c < oe 
OUTPUT: c mod ps2. 
1. Define 521-bit integers: 
S] = (C1041, +++» C523, C522, C521) 
$2 = (C520, .--,€2, C1, C0). 
2. Return(s; +52 mod ps21). 
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2.3. Binary field arithmetic 


This section presents algorithms that are suitable for performing binary field arith- 
metic in software. Chapter 5 includes additional material on use of single-instruction 
multiple-data (SIMD) registers found on some processors (§5.1.3), and on design con- 
siderations for hardware implementation (§5.2.2). Selected timings for field operations 
appear in §5.1.5. 

We assume that the implementation platform has a W-bit architecture where W is 
a multiple of 8. The bits of a W-bit word U are numbered from 0 to W — 1, with the 
rightmost bit of U designated as bit 0. The following standard notation is used to denote 
operations on words U and V: 


U®V _ bitwise exclusive-or 

U&V bitwise AND 

U> i _ right shift of U by 7 positions with the 7 high-order bits set to 0 
U <i left shift of U by i positions with the 7 low-order bits set to 0. 


Let f(z) be an irreducible binary polynomial of degree m, and write f(z) = 
z’” +r(z). The elements of Fa” are the binary polynomials of degree at most m — 1. 
Addition of field elements is the usual addition of binary polynomials. Multiplication is 
performed modulo f(z). A field element a(z) = am—1z'""~! +++» +427 +a1z-+a09 is as- 
sociated with the binary vector a = (dm—1,.--,42, 41, a9) of length m. Let t = [m/W), 
and let s = Wt —m. In software, a may be stored in an array of t W-bit words: 
A = (A[t —1],..., A[2], A[1], A[0]), where the rightmost bit of A[0] is ao, and the 
leftmost s bits of A[t — 1] are unused (always set to 0). 





A[t —1] A[1] A[0] 





dm—1--40—-yw | | a@ow-1--aw4i1aw | aw-1--: a1a0 
Sa’ 
S 


Figure 2.4. Representation of a € F2” as an array A of W-bit words. The s = tW — m highest 
order bits of A[t — 1] remain unused. 


2.3.1 Addition 


Addition of field elements is performed bitwise, thus requiring only t word operations. 


Algorithm 2.32 Addition in Fo 


INPUT: Binary polynomials a(z) and b(z) of degrees at most m — 1. 
OUTPUT: c(z) = a(z) + D(z). 
1. Fori from 0 to t— 1 do 
1.1 Cli]<Ali] 6 BE]. 
2. Return(c). 
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2.3.2 Multiplication 


The shift-and-add method (Algorithm 2.33) for field multiplication is based on the 
observation that 


a(z)-b(z) = @m—12"—!b(z) +: +» +anz*b(z) +41 zb(z) + aob(z). 


Iteration i in the algorithm computes z'b(z) mod f(z) and adds the result to the 








accumulator c if a; = 1. If b(z) = bm_1z"—! +--+: tboz* +b1z+ bo, then 
D(z) + Z = Dm —1Z™ + B22 + toe? +b 2? + oz 
= Dir (Z) + (Om—22" | +++ F222 +biz*+boz) (mod f(z). 





Thus b(z)-z mod f(z) can be computed by a left-shift of the vector representation of 
b(z), followed by addition of r(z) to b(z) if the high order bit b,,_ is 1. 


Algorithm 2.33 Right-to-left shift-and-add field multiplication in Fm 


INPUT: Binary polynomials a(z) and b(z) of degree at most m — 1. 
OUTPUT: c(z) = a(z)-b(z) mod f(z). 
1. If ag = 1 then c <5; else c —0. 
2. For i from | to m—1 do 
2.1 b<b-z mod f(z). 
2.2 Ifa; = 1 thenc<c+b. 
3. Return(c). 


While Algorithm 2.33 is well-suited for hardware where a vector shift can be per- 
formed in one clock cycle, the large number of word shifts make it less desirable 
for software implementation. We next consider faster methods for field multiplication 
which first multiply the field elements as polynomials (§2.3.3 and §2.3.4), and then 
reduce the result modulo f(z) (§2.3.5). 


2.3.3 Polynomial multiplication 


The right-to-left comb method (Algorithm 2.34) for polynomial multiplication is based 
on the observation that if b(z)-z* has been computed for some k € [0, W — 1], then 
b(z)-z/+* can be easily obtained by appending j zero words to the right of the vector 
representation of b(z)-z*. Algorithm 2.34 processes the bits of the words of A from 
right to left, as shown in Figure 2.5 when the parameters are m = 163, W = 32. The 
following notation is used: if C = (C[n],..., C[2], C[1], C[0]) is an array, then C{j} 
denotes the truncated array (C[n],..., CL 7 +1], CLj)). 
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a 
5 [ass [a 


Al2] 2 ao 65 
A[3] | aiz7 | | 98 | 97 | 96 _| 
Al4] | aiso | | ai30 | a9 | 428 | 
AIS] LT att62 | 161 | @160 | 


Figure 2.5. The right-to-left comb method (Algorithm 2.34) processes the columns of the expo- 
nent array for a right-to-left. The bits in a column are processed from top to bottom. Example 
parameters are W = 32 andm = 163. 





Algorithm 2.34 Right-to-left comb method for polynomial multiplication 


INPUT: Binary polynomials a(z) and b(z) of degree at most m — 1. 
OUTPUT: c(z) = a(z): D(z). 
1. C<0. 
2. For k from 0 to W — 1 do 
2.1 For j from 0 to t—1 do 
If the kth bit of A[j] is 1 then add B to C{j}. 
2.2 Ifk A(W—1) then B< B-z. 
3. Return(C). 


The left-to-right comb method for polynomial multiplication processes the bits of a 
from left to right as follows: 


a(z)-b(z) = ( + ((Am—1b(2)z + dm—2b(z))z + dm—3b(z))z ++ + aib(z))2 + aob(z). 


Algorithm 2.35 is a modification of this method where the bits of the words of A are 
processed from left to right. This is illustrated in Figure 2.6 when m = 163, W = 32 
are the parameters. 


A[0] 
AI] 
A[2] 
| a96_| ALB] 

ps | ati30 | 29 | aire | ALA] 
Dae [ait Taio | A051 


Figure 2.6. The left-to-right comb method (Algorithm 2.35) processes the columns of the expo- 
nent array for a left-to-right. The bits in a column are processed from top to bottom. Example 
parameters are W = 32 andm = 163. 
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Algorithm 2.35 Left-to-right comb method for polynomial multiplication 


INPUT: Binary polynomials a(z) and b(z) of degree at most m — 1. 
OUTPUT: c(z) = a(z)- D(z). 
1. C<0. 
2. For k from W — 1 downto 0 do 
2.1 For j from 0 to t—1 do 
If the kth bit of A[j] is 1 then add B to C{/}. 
2.2 Ifk AO then C<C-z. 
3. Return(C). 


Algorithms 2.34 and 2.35 are both faster than Algorithm 2.33 since there are fewer 
vector shifts (multiplications by z). Algorithm 2.34 is faster than Algorithm 2.35 since 
the vector shifts in the former involve the t-word array B (which can grow to size t+ 1), 
while the vector shifts in the latter involve the 2t-word array C. 

Algorithm 2.35 can be accelerated considerably at the expense of some storage over- 
head by first computing u(z) - b(z) for all polynomials u(z) of degree less than w, and 
then processing the bits of A[j] w at a time. The modified method is presented as Al- 
gorithm 2.36. The order in which the bits of a are processed is shown in Figure 2.7 
when the parameters are M = 163, W = 32, w = 4. 


Algorithm 2.36 Left-to-right comb method with windows of width w 


INPUT: Binary polynomials a(z) and b(z) of degree at most m — 1. 
OUTPUT: c(z) = a(z): D(z). 
1. Compute B,, = u(z)-b(z) for all polynomials u(z) of degree at most w — 1. 
2. C<0. 
3. For k from (W/w) — | downto 0 do 
3.1 For j from 0 to t—1 do 
Let uv = (Up—],...,U1, U0), Where u; is bit (wk +i) of A[/]. 
Add B, to C{j}. 
3.2 Ifk AO thenC<—C-z”. 
4. Return(C). 


As written, Algorithm 2.36 performs polynomial multiplication—modular reduction 
for field multiplication is performed separately. In some situations, it may be advanta- 
geous to include the reduction polynomial f as an input to the algorithm. Step 1 may 
then be modified to calculate ub mod f, which may allow optimizations in step 3. 


Note 2.37 (enhancements to Algorithm 2.36) Depending on processor characteristics, 
one potentially useful variation of Algorithm 2.36 exchanges shifts for additions and 
table lookups. Precomputation is split into / tables; for simplicity, we assume / | w. Ta- 
ble i, 0 <i </, consists of values By ; = v(z)zi¥/b(z) for all polynomials v of degree 
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Figure 2.7. Algorithm 2.36 processes columns of the exponent array for a left-to-right. The 
entries within a width w column are processed from top to bottom. Example parameters are 
W = 32,m = 163, andw = 4. 


less than w//. Step 3.1 of Algorithm 2.36 is modified to calculate By, = oa Bui j 
where u = (uy_1,...,U9) = (u'!,...,u°) and u! has w/TI bits. As an example, Al- 
gorithm 2.36 with w = 4 has 16 elements of precomputation. The modified algorithm 
with parameters w = 8 and / = 4 has the same amount of precomputation (four tables 
of four points each). Compared with the original algorithm, there are fewer iterations 
at step 3 (and hence fewer shifts at step 3.2); however, step 3.1 is more expensive. 


The comb methods are due to Lépez and Dahab, and are based on the observation 
that the exponentiation methods of Lim and Lee can be adapted for use in binary fields. 
§3.3.2 discusses Lim-Lee methods in more detail in the context of elliptic curve point 
multiplication; see Note 3.47. 


Karatsuba-Ofman multiplication 


The divide-and-conquer method of Karatsuba-Ofman outlined in §2.2.2 can be directly 
adapted for the polynomial case. For example, 


a(z)b(z) = (Ayz! + Ao)(Biz! + Bo) 
= A, Byz~ +[(Ay + Ao)(B1 + Bo) + A1 Bi + Ao Bolz! + Ao Bo 


where / = [m/2] and the coefficients Ag, A1, Bo, By are binary polynomials in z of 
degree less than /. The process may be repeated, using table-lookup or other methods 
at some threshold. The overhead, however, is often sufficient to render such strategies 
inferior to Algorithm 2.36 for m of practical interest. 


Note 2.38 (implementing polynomial multiplication) Algorithm 2.36 appears to be 
among the fastest in practice for binary fields of interest in elliptic curve methods, 
provided that the hardware characteristics are targeted reasonably accurately. The code 
produced by various C compilers can differ dramatically in performance, and compilers 
can be sensitive to the precise form in which the algorithm is written. 


52 2. Finite Field Arithmetic 


The contribution by Sun Microsystems Laboratories (SML) to the OpenSSL project 
in 2002 provides a case study of the compromises chosen in practice. OpenSSL is 
widely used to provide cryptographic services for the Apache web server and the 
OpenSSH secure shell communication tool. SML’s contribution must be understood in 
context: OpenSSL is a public and collaborative effort—it is likely that Sun’s proprietary 
code has significant enhancements. 

To keep the code size relatively small, SML implemented a fairly generic polynomial 
multiplication method. Karatsuba-Ofman is used, but only on multiplication of 2-word 
quantities rather than recursive application. At the lowest level of multiplication of 
1-word quantities, a simplified Algorithm 2.36 is applied (with w = 2, w = 3, and 
w =4 on 16-bit, 32-bit, and 64-bit platforms, respectively). As expected, the result 
tends to be much slower than the fastest versions of Algorithm 2.36. In our tests on Sun 
SPARC and Intel P6-family hardware, the Karatsuba-Ofman method implemented is 
less efficient than use of Algorithm 2.36 at the 2-word stage. However, the contribution 
from SML may be a better compromise in OpenSSL if the same code is used across 
platforms and compilers. 


2.3.4 Polynomial squaring 


Since squaring a binary polynomial is a linear operation, it is much faster than mul- 
tiplying two arbitrary polynomials; i.e., if a(z) = tet foes tae tae, 
then 





a(Z)> = Amr? + Fanzt +.aj27 +409. 





The binary representation of a(z)” is obtained by inserting a 0 bit between consecutive 
bits of the binary representation of a(z) as shown in Figure 2.8. To facilitate this pro- 
cess, a table T of size 512 bytes can be precomputed for converting 8-bit polynomials 
into their expanded 16-bit counterparts. Algorithm 2.39 describes this procedure for 
the parameter W = 32. 


afm] [a [| 
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Algorithm 2.39 Polynomial squaring (with wordlength W = 32) 


INPUT: A binary polynomial a(z) of degree at most m — 1. 
OUTPUT: c(z) = a(z)?. 
1. Precomputation. For each byte d = (d7,..., d,, dy), compute the 16-bit quantity 
T (d) = (0, d7,...,0,d, 0, do). 
2. Fori from 0 to t—1 do 
2.1 Let A[i] = (v3, U2, U1, Uo) where each u ; is a byte. 
2.2 C[2i]< (T@1), T(uo)), Cl2i + 1] <— (Tus), T (u2)). 
3. Return(c). 


2.3.5 Reduction 


We now discuss techniques for reducing a binary polynomial c(z) obtained by multi- 
plying two binary polynomials of degree < m — 1, or by squaring a binary polynomial 
of degree < m— 1. Such polynomials c(z) have degree at most 2m — 2. 


Arbitrary reduction polynomials 


Recall that f(z) = z” +r(z), where r(z) is a binary polynomial of degree at most 
m—1. Algorithm 2.40 reduces c(z) modulo f(z) one bit at a time, starting with the 
leftmost bit. It is based on the observation that 





€(2) = Com— 220? Ho bem + Om 12" | He FEZ +60 
° Cm)r(Z) + Cm—12" 1 +--+ e1z+e9 (mod f(z). 











= (C2m 9g 


The reduction is accelerated by precomputing the polynomials z‘r(z),0<k <W—1. 
If r(z) is a low-degree polynomial, or if f(z) is a trinomial, then the space requirements 
are smaller, and furthermore the additions involving zkr(z) in step 2.1 are faster. The 
following notation is used: if C = (C[n],..., C[2], C[1], C[0]) is an array, then C{j} 
denotes the truncated array (C[n],...,C[j +1], CLj)). 


Algorithm 2.40 Modular reduction (one bit at a time) 


INPUT: A binary polynomial c(z) of degree at most 2m — 2. 
OUTPUT: c(z) mod f(z). 
1. Precomputation. Compute ux (z) = z*r(z), 0<k<W-l1. 
2. For i from 2m — 2 downto m do 
2.1 Ifc; = 1 then 
Let 7 = |@—m)/W]| andk = (i —m) — Wj. 
Add ux(z) to C{j}. 
3. Returnn(C[t — 1],..., C[1], C[0]). 
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If f(z) is a trinomial, or a pentanomial with middle terms close to each other, then 
reduction of c(z) modulo f(z) can be efficiently performed one word at a time. For 
example, suppose m = 163 and W = 32 (so tf = 6), and consider reducing the word 
C[9] of c(z) modulo f(z) = z!©3 + z7 + 26+ 23+1. The word C[9] represents the 
polynomial 6319291? ++ ++ + 67897789 + c2g9z788. We have 





3 3 
7288 =! 2 z! 1 7128 z)2s (mod f (2), 


33 3 
7289 =! z! 2 7129 126 (mod f (2), 

















3 
2319 = 7163 4 (1624 91594 2156 (mod f(z). 


By considering the four columns on the right side of the above congruences, we see that 
reduction of C[9] can be performed by adding C[9] four times to C, with the rightmost 
bit of C[9] added to bits 132, 131, 128 and 125 of C; this is illustrated in Figure 2.9. 
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Figure 2.9. Reducing the 32-bit word C[9] modulo f (z) = 71634 774 7647341, 


NIST reduction polynomials 
We next present algorithms for fast reduction modulo the following reduction 
polynomials recommended by NIST in the FIPS 186-2 standard: 
f@aH2i3 472747294341 
fQaH23424 41 
F@QHaPP +z t+ P41 
F() = 4 4.287 41 


f(z) = pi zl P4272 es 
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These algorithms, which assume a wordlength W = 32, are based on ideas similar to 
those leading to Figure 2.9. They are faster than Algorithm 2.40 and furthermore have 
no storage overhead. 


Algorithm 2.41 Fast reduction modulo f(z) = z163 4 274 764 2341 (with W = 32) 
INPUT: A binary polynomial c(z) of degree at most 324. 
OUTPUT: c(z) mod f(z). 
1. For i from 10 downto 6 do {Reduce C[i]z**! modulo f(z} 
1.1 T<C{i]. 
12 Cli—6]<Cli-—6] @(T «K 29). 
13 Cli-S])<—Cli-S]}JO7 KH) OT K3) OT S(T > 3). 
1.4 Cli—4]<—C[i—4] @(T > 28) @(T > 29). 
. T—C[5] > 3. {Extract bits 3-31 of C[5]} 
. C]—-Ch]e7 <7)ET7 <6 <3) GT. 
. Ci1]<— C1] 6 (Lf > 25) 8 (T > 26). 
. C[5]<—C[5] & 0x7. {Clear the reduced bits of C[5]} 
. Return (C[5], C[4], C[3], C[2], CU], C[0]). 


NNW WD 


Algorithm 2.42 Fast reduction modulo f(z) = z7°? +z’4 +1 (with W = 32) 
INPUT: A binary polynomial c(z) of degree at most 464. 
OUTPUT: c(z) mod f(z). 
1. For i from 15 downto 8 do {Reduce C[i]z**! modulo f(z} 
1.1 T<C{i]. 
1.2 Cli—8]<C[i—8]@(T <« 23). 
13 Cli-—7]<Cli—7] @(T > 9). 
14 Cfi-—S]<Cli-5]@(T <1). 
1.5 Cli—4]<—C[i—4] @(T > 31). 
.T<—C[7> 9. {Extract bits 9-31 of C[7]} 
. ClO)< ClO) OT. 
. C[2]<—C[2]6(T « 10). 
. CIB] <—C[3B] 8 (T > 22). 
. C[7] <—C[7] & Ox FF. {Clear the reduced bits of C[7]} 
. Return (C[7], C[6], C[5], C[4], C[3], C[2], C[1], C[0]). 


NAN WwW WD 
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Algorithm 2.43 Fast reduction modulo f(z) =z? + z!?+2’+2°+1 (with W = 32) 


INPUT: A binary polynomial c(z) of degree at most 564. 
OUTPUT: c(z) mod f(z). 
1. For i from 17 downto 9 do {Reduce C[i]z**! modulo f(z} 
1.1 T<C{i]. 
1.2 Cfi-9])<—Cli-9)6(T «5 OW K 10 O(T K12)6(T « 17). 
1.3 Cli-—8]<C[i —8] 6 (7 > 27) @(T > 22) 6 (T > 20) @(T > 15). 


2. T <—C[8] > 27. {Extract bits 27-31 of C[8]} 
3. ClO]<-ClOJ6TOT <KS5SOT «KOT « 12). 
4. C[8]<C[8] & OX7FFFFFF. {Clear the reduced bits of C[8]} 


5. Return (C[8], C[7], C[6], C[5], C14], C[3], C[2], C[1], C[0}). 


Algorithm 2.44 Fast reduction modulo f(z) = 240? + z87 +1 (with W = 32) 


INPUT: A binary polynomial c(z) of degree at most 816. 
OUTPUT: c(z) mod f(z). 
1. For i from 25 downto 13 do {Reduce C[i]z**! modulo f(z} 

1.1 T<C{i]. 
1.2 Cfi-—13])<Cli-13])@(T «7). 
1.3 C{i—12] <—Cli-—12] @6(T > 25). 
14 Cfi-—11)<—Cfi-11] 6(T « 30). 
1.5 C{i—10] —C[i—10] @(T > 2). 


2. T<—C[12] > 25. {Extract bits 25-31 of C[12]} 

3. C[O] << C[O] @T. 

4. C[2]<—C[2] 6 (7j}23). 

5. C[12] —C[12] & 0x1 FFFFFF. {Clear the reduced bits of C[12]} 
6. Return (C[12], C[11],..., CU], C[0]). 


Algorithm 2.45 Fast reduction modulo f(z) =z°/!+z!9+2°+27+ 1 (with W = 32) 


INPUT: A binary polynomial c(z) of degree at most 1140. 
OUTPUT: c(z) mod f(z). 
1. For i from 35 downto 18 do {Reduce C[i]z**! modulo f(z} 
1.1 T<C{i]. 
1.2 Cli—18])<-Cfi-18]6(7 KS) OT KI OT «K 10) 6(T « 15). 
13 Cli-—17]<—Cli-17) @(T > 27) @(T > 25) @(T > 22) O(T > 17). 


2. T <—C[17] > 27. {Extract bits 27-31 of C[17]} 
3. C(O] ClOJOT OT K2)0(T <5) O(T « 10). 
4. C[17] — C[17] & Ox7FFFFFFF. {Clear the reduced bits of C[17]} 


5. Return (C[17], C[16],..., C[1], C[0]). 
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2.3.6 Inversion and division 


In this subsection, we simplify the notation and denote binary polynomials a(z) by a. 
Recall that the inverse of a nonzero element a € Fy” is the unique element g € 2” such 
that ag = 1 in Fy, that is, ag =1 (mod f). This inverse element is denoted a~! mod 
f or simply a7! if the reduction polynomial f is understood from context. Inverses 
can be efficiently computed by the extended Euclidean algorithm for polynomials. 


The extended Euclidean algorithm for polynomials 


Let a and b be binary polynomials, not both 0. The greatest common divisor (gcd) of a 
and b, denoted gcd(a, b), is the binary polynomial d of highest degree that divides both 
a and b. Efficient algorithms for computing gcd(a, b) exploit the following polynomial 
analogue of Theorem 2.18. 


Theorem 2.46 Let a and b be binary polynomials. Then gcd(a, b) = gcd(b — ca, a) 
for all binary polynomials c. 


In the classical Euclidean algorithm for computing the gcd of binary polynomials a 
and b, where deg(b) > deg(a), b is divided by a to obtain a quotient g and a remainder 
r satisfying b = qa+r and deg(r) < deg(a). By Theorem 2.46, gcd(a, b) = gced(r, a). 
Thus, the problem of determining gcd(a, b) is reduced to that of computing gcd(r, a) 
where the arguments (7, a) have lower degrees than the degrees of the original argu- 
ments (a,b). This process is repeated until one of the arguments is zero—the result is 
then immediately obtained since gcd(0, d) = d. The algorithm must terminate since the 
degrees of the remainders are strictly decreasing. Moreover, it is efficient because the 
number of (long) divisions is at most k where k = deg(a). 

In a variant of the classical Euclidean algorithm, only one step of each long division 
is performed. That is, if deg(b) > deg(a) and j = deg(b) — deg(a), then one computes 
r=b+dz/a. By Theorem 2.46, gcd(a, b) = gcd(r, a). This process is repeated until a 
zero remainder is encountered. Since deg(r) < deg(b), the number of (partial) division 
steps is at most 2k where k = max{deg(a), deg(b)}. 

The Euclidean algorithm can be extended to find binary polynomials g and h 
satisfying ag + bh = d where d = gcd(a, b). Algorithm 2.47 maintains the invariants 


ag; tbh, =u 





agot+bh2 =v. 


The algorithm terminates when u = 0, in which case v = gcd(a, b) and agz + bh2 =d. 
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Algorithm 2.47 Extended Euclidean algorithm for binary polynomials 


INPUT: Nonzero binary polynomials a and b with deg(a) < deg(b). 
OUTPUT: d = gcd(a, b) and binary polynomials g,h satisfying ag + bh =d. 
l. u<-a,v<b. 
2. 81< 1, g2< 0, hy < 0, ho < 1. 
3. While u 40 do 
3.1 j <deg(u) —deg(v). 
3.2 If 7 <O then: uv, g1 © g2,h, oho, j<—— J. 
3.3 ueutziv. 
3.4 gigi +z! go,hi hy tho. 
4.d<v, g<—g2,h<—hy. 
5. Return(d, g,/). 








Suppose now that f is an irreducible binary polynomial of degree m and the nonzero 
polynomial a has degree at most m — 1 (hence gcd(a, f) = 1). If Algorithm 2.47 is 
executed with inputs a and f, the last nonzero u encountered in step 3.3 is u = 1. After 
this occurrence, the polynomials g; and /1, as updated in step 3.4, satisfy ag; + fh, = 
1. Hence ag; = 1 (mod f) and so a t= g1. Note that 1; and 2 are not needed for the 
determination of g,. These observations lead to Algorithm 2.48 for inversion in F 7”. 


Algorithm 2.48 Inversion in F2” using the extended Euclidean algorithm 
INPUT: A nonzero binary polynomial a of degree at most m — 1. 
Output: a~! mod f. 
l.u<ayvu<f. 
2. g1<-1,22< 0. 
3. While u 4 1 do 
3.1 j <deg(u) — deg(v). 
3.2 If 7 <O then: u <= v, g1 © go, j < ds 
3.3 ucutzi. 
3.4 g1<g1 +2! g0. 
4. Return(g1). 





Binary inversion algorithm 


Algorithm 2.49 is the polynomial analogue of the binary algorithm for inversion in 
F, (Algorithm 2.22). In contrast to Algorithm 2.48 where the bits of u and v are 
cleared from left to right (high degree terms to low degree terms), the bits of u and 
vin Algorithm 2.49 are cleared from right to left. 
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Algorithm 2.49 Binary algorithm for inversion in Fa 
INPUT: A nonzero binary polynomial a of degree at most m — 1. 
Output: a7! mod f. 
l.u<ayvu<f. 
2. g1< 1, g2< 0. 
3. While (uv £1 and v £1) do 
3.1 While z divides u do 
u<u/z. 
If z divides g; then g; < g;/z; else g) —(g1 + f)/z. 
3.2 While z divides v do 
v<o/z. 
If z divides gz then go < g2/z; else g2 <—(g2+ f)/z. 
3.3 If deg(u) > deg(v) then: u<—u+u, g1 <—g1 + 82; 
Else: v<-u+u, g2<g24+ 81. 








4. If u = 1 then return(g1); else return(g?). 
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The expression involving degree calculations in step 3.3 may be replaced by a sim- 
pler comparison on the binary representations of the polynomials. This differs from 


Algorithm 2.48, where explicit degree calculations are required in step 3.1. 


Almost inverse algorithm 


The almost inverse algorithm (Algorithm 2.50) is a modification of the binary inversion 
algorithm (Algorithm 2.49) in which a polynomial g and a positive integer k are first 


computed satisfying 


ag=z* (mod f). 


A reduction is then applied to obtain 


The invariants maintained are 


agit fhy= rag 





ago+ fho= zy 


for some hy, hz that are not explicitly calculated. 
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Algorithm 2.50 Almost Inverse Algorithm for inversion in Fo” 


INPUT: A nonzero binary polynomial a of degree at most m — 1. 
Output: a7! mod f. 
l.u<av<f. 
2. gi< 1, g2<0,k <0. 
3. While (u £1 and v £1) do 
3.1 While z divides u do 
u<—u/Z, g2<-Z:g2,kK<—k+1. 
3.2 While z divides v do 
v<ov/z, g1<—z-e1,kK—k+1. 
3.3 If deg(u) > deg(v) then: u<—u+v, g] —g1+ 82. 
Else: v<-u+u, g2< 82+ 81. 
4. Ifu=1 then g< g 1; else g< gp. 
5. Return(z~*g mod f). 











The reduction in step 5 can be performed as follows. Let / = min{i > 1 | f; = 1}, 
where f(z) = finz” +---+ fiz+ fo. Let S be the polynomial formed by the / rightmost 
bits of g. Then Sf + g is divisible by z! and T = (Sf + g)/z! has degree less than m; 
thus T = gz~! mod f. This process can be repeated to finally obtain gz_* mod f. The 
reduction polynomial f is said to be suitable if | is above some threshold (which may 
depend on the implementation; e.g., / > W is desirable with W-bit words), since then 
less effort is required in the reduction step. 

Steps 3.1-3.2 are simpler than those in Algorithm 2.49. In addition, the g; and 
g2 appearing in these algorithms grow more slowly in almost inverse. Thus one can 
expect Algorithm 2.50 to outperform Algorithm 2.49 if the reduction polynomial is 
suitable, and conversely. As with the binary algorithm, the conditional involving degree 
calculations may be replaced with a simpler comparison. 


Division 

The binary inversion algorithm (Algorithm 2.49) can be easily modified to perform 
division b/a = ba~!. In cases where the ratio //M of inversion to multiplication costs 
is small, this could be especially significant in elliptic curve schemes, since an elliptic 


curve point operation in affine coordinates (see §3.1.2) could use division rather than 
an inversion and multiplication. 


Division based on the binary algorithm To obtain b/a, Algorithm 2.49 is modified 
at step 2, replacing g; <1 with g; <b. The associated invariants are 


agi t+ fh; =ub 
ago + fh2 = vb. 
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On termination with u = 1, it follows that gj = ba~!. The division algorithm is 
expected to have the same running time as the binary algorithm, since gi in Algo- 
rithm 2.49 goes to full-length in a few iterations at step 3.1 (i.e., the difference in 
initialization of g; does not contribute significantly to the time for division versus 
inversion). 

If the binary algorithm is the inversion method of choice, then affine point operations 
would benefit from use of division, since the cost of a point double or addition changes 
from 1+2M to 1+M.( and M denote the time to perform an inversion and a multi- 
plication, respectively.) If J/M is small, then this represents a significant improvement. 
For example, if //M is 3, then use of a division algorithm variant of Algorithm 2.49 
provides a 20% reduction in the time to perform an affine point double or addition. 
However, if 1/M > 7, then the savings is less than 12%. Unless /M is very small, it 
is likely that schemes are used which reduce the number of inversions required (e.g., 
halving and projective coordinates), so that point multiplication involves relatively few 
field inversions, diluting any savings from use of a division algorithm. 


Division based on the extended Euclidean algorithm Algorithm 2.48 can be trans- 
formed to a division algorithm in a similar fashion. However, the change in the 
initialization step may have significant impact on implementation of a division algo- 
rithm variant. There are two performance issues: tracking of the lengths of variables, 
and implementing the addition to g; at step 3.4. 

In Algorithm 2.48, it is relatively easy to track the lengths of u and v efficiently 
(the lengths shrink), and, moreover, it is also possible to track the lengths of g; and 
g2. However, the change in initialization for division means that g; goes to full-length 
immediately, and optimizations based on shorter lengths disappear. 

The second performance issue concerns the addition to g; at step 3.4. An imple- 
mentation may assume that ordinary polynomial addition with no reduction may be 
performed; that is, the degrees of gi and gz never exceed m — 1. In adapting for division, 
step 3.4 may be less-efficiently implemented, since g1 is full-length on initialization. 


Division based on the almost inverse algorithm Although Algorithm 2.50 is similar 
to the binary algorithm, the ability to efficiently track the lengths of g and g2 (in addi- 
tion to the lengths of u and v) may be an implementation advantage of Algorithm 2.50 
over Algorithm 2.49 (provided that the reduction polynomial f is suitable). As with 
Algorithm 2.48, this advantage is lost in a division algorithm variant. 

It should be noted that efficient tracking of the lengths of g; and go (in addition to the 
lengths of u and v) in Algorithm 2.50 may involve significant code expansion (perhaps 
t? fragments rather than the ¢ fragments in the binary algorithm). If the expansion 
cannot be tolerated (because of application constraints or platform characteristics), then 
almost inverse may not be preferable to the other inversion algorithms (even if the 
reduction polynomial is suitable). 
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2.4 Optimal extension field arithmetic 


Preceding sections discussed arithmetic for fields F,” in the case that p = 2 (binary 
fields) and m = 1 (prime fields). As noted on page 28, the polynomial basis repre- 
sentation in the binary field case can be generalized to all extension fields F,”, with 
coefficient arithmetic performed in Fp. 

For hardware implementations, binary fields are attractive since the operations in- 
volve only shifts and bitwise addition modulo 2. The simplicity is also attractive for 
software implementations on general-purpose processors; however the field multipli- 
cation is essentially a few bits at a time and can be much slower than prime field 
arithmetic if a hardware integer multiplier is available. On the other hand, the arith- 
metic in prime fields can be more difficult to implement efficiently, due in part to the 
propagation of carry bits. 

The general idea in optimal extension fields is to select p, m, and the reduction poly- 
nomial to more closely match the underlying hardware characteristics. In particular, 
the value of p may be selected to fit in a single word, simplifying the handling of carry 
(since coefficients are single-word). 


Definition 2.51 An optimal extension field (OEF) is a finite field F, such that: 

1. p =2" —c for some integers n and c with log, |c| < n/2; and 

2. an irreducible polynomial f(z) = 2” —q@ in F,[z] exists. 
If c € {41}, then the OFF is said to be of Type I (p is a Mersenne prime if c = 1); if 
@ = 2, the OEF is said to be of Type II. 


Type I OEFs have especially simple arithmetic in the subfield F,, while Type II 
OEFs allow simplifications in the F,” extension field arithmetic. Examples of OEFs 
are given in Table 2.1. 
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Table 2.1. OEF example parameters. Here, p = 2" —c is prime, and f(z) =z" —w € F p[z] is 
irreducible over F p. The field is F p = F p[z]/(f) of order approximately 2'"”. 


The following results can be used to determine if a given polynomial f(z) = z’” —@ 
is irreducible in F ,[z]. 
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Theorem 2.52 Let m > 2 be an integer and w € F*. Then the binomial f(z) =z” —@ 
is irreducible in F,[z] if and only if the following two conditions are satisfied: 


(i) each prime factor of m divides the order e of w in Fi but not (p — 1)/e; 
(ii) p =1 (mod 4) ifm =0 (mod 4). 
If the order of w as an element of F, is p —1, then @ is said to be primitive. It is 


easily verified that conditions (i) and (ii) of Theorem 2.52 are satisfied if w is primitive 
and m|(p — 1). 


Corollary 2.53 If w is a primitive element of F’, and m|(p — 1), then z’” —@ is 
irreducible in F,[z]. 


Elements of F,” are polynomials 


—l 2 
(Zz) =An—12""—— ++ Faz +aiz+a0 





where the coefficients a; are elements of F ,. We next present algorithms for performing 
arithmetic operations in OEFs. Selected timings for field operations appear in §5.1.5. 


2.4.1 Addition and subtraction 
If a(z) = ar ajz' and b(z) = ear bz! are elements of Fp, then 
m—-1 


a(z)+b(2) = do ciz’, 
i=0 


where cj = (aj +5;) mod p; that is, p is subtracted whenever a; + bj > p. Subtraction 
of elements of F )™ is done similarly. 


2.4.2 Multiplication and reduction 


Multiplication of elements a, b € Fp» can be done by ordinary polynomial multiplica- 
tion in Z[z] (i.e., multiplication of polynomials having integer coefficients), along with 
coefficient reductions in F,, and a reduction by the polynomial f. This multiplication 
takes the form 


m—1 m—1 
c(z) = a(z)b@) = (Yaz!) (> v2!) 
i=0 j=0 
2m—2 


m—2 
= 3 cee* = Cm1Z" | + ae +ack+m)z* (mod f(z)) 
k=0 k=0 
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where 
tc= > ajb; mod p. 
i+j=k 
Karatsuba-Ofman techniques may be applied to reduce the number of F, multiplica- 
tions. For example, 


a(z)b(z) = (Ai z! + Ao)(Biz! + Bo) 
= A, By2! + [(Ay + Ao)(B1 + Bo) — A1 Bi — Ao Bolz! + Ao Bo 


where / = [m/2] and the coefficients Ag, Aj, Bo, By are polynomials in F,[z] of 
degree less than /. The process may be repeated, although for small values of m it may 
be advantageous to consider splits other than binary. The analogous case for prime 
fields was discussed in §2.2.2. 


Reduction inF » 


The most straightforward implementation performs reductions in F, for every addi- 
tion and multiplication encountered during the calculation of each cx. The restriction 
log, |c| < n/2 means that reduction in the subfield F,, requires only a few simple op- 
erations. Algorithm 2.54 performs reduction of base-B numbers, using only shifts, 
additions, and single-precision multiplications. 


Algorithm 2.54 Reduction modulo M = B” —c 


INPUT: A base B, positive integer x, and modulus M = B” —c where c is an /-digit 
base-B positive integer for some / < n. 
OUTPUT: x mod M. 
lL. gos |4/B" | tos — gpk". {x = qoB" +ro with ro < B"} 
2. r<ro,i<O. 
3. While gj > 0 do 
3.1 git1 <—|qic/B"]. {gic = qi41B" +ri41 with 7541 < B”} 
3.2 Ti+] <—qic — gi+1B". 
3.3 i<i+tlr<rt+r;. 
4. While r > M do: r<r—M. 
5. Return(r). 


Note 2.55 (implementation details for Algorithm 2.54) 


(i) If! <n/2 and x has at most 2n base-B digits, Algorithm 2.54 executes step 3.1 
at most twice (i.e., there are at most two multiplications by c). 


(ii) As an alternative, the quotient and remainder may be folded into x at each stage. 
Steps 1-4 are replaced with the following. 
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1. While x > B” 
1.1 Write x = vB" +u with u < B". 
1.2 x<cutu. 

2. Ifx > M thenx<x—-M. 


(ii) Algorithm 2.54 can be modified to handle the case M = B" +c for some posi- 
tive integer c < B*-!<in step 3.3, replace r<—r + 7; with r<—r-+ (—1)'7;, and 
modify step 4 to also process the case r < 0. 


For OEFs, Algorithm 2.54 with B = 2 may be applied, requiring at most two multi- 
plications by c in the case that x < 27". When c = 1 (a type I OEF) and x < (p— 1}, 
the reduction is given by: 


write x =2”u+u; x<v+u; ifx > pthenx<x-—p. 


Type I OEFs are attractive in the sense that F’,, multiplication (with reduction) can be 
done with a single multiplication and a few other operations. However, the reductions 
modulo p are likely to contribute a significant amount to the cost of multiplication in 
Fp, and it may be more efficient to employ a direct multiply-and-accumulate strategy 
to decrease the number of reductions. 


Accumulation and reduction 


The number of F,, reductions performed in finding the product c(z) = a(z)b(z) in 
F, can be decreased by accumulation strategies on the coefficients of c(z). Since 
f(z) =z" —@, the product can be written 





2m—2 m—1 2m—2 
c(z) = a(2)b(z) = DY) ces = =e +o >~ cxzkm 
k=0 k=m 
m—-1 m—1 
= =F (Saito y aibiim-i) 2 (mod f(z). 
i=0 i=k+1 
c 


If the coefficient ci is calculated as an expression in Z (i.e., as an integer without 
reduction modulo p), then c, mod p may be performed with a single reduction (rather 
than m reductions). The penalty incurred is the multiple-word operations (additions 
and multiplication by w) required in accumulating the terms of c,. 

In comparison with the straightforward reduce-on-every-operation strategy, it should 
be noted that complete reduction on each F, operation may not be necessary; for exam- 
ple, it may suffice to reduce the result to a value which fits in a single word. However, 
frequent reduction (to a single word or value less than 2”) is likely to be expensive, 
especially if a “carry” or comparison must be processed. 
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Depending on the value of p, the multiply-and-accumulate strategy employs two or 
three registers for the accumulation (under the assumption that p fits in a register). The 
arithmetic resembles that commonly used in prime-field implementations, and multipli- 
cation cost in F,” is expected to be comparable to that in a prime field F, where g © p’” 
and which admits fast reduction (e.g., the NIST-recommended primes in §2.2.6). 

For the reduction c;, mod p, note that 


ck < (p— 1)? +a(m— 1)(p—1)? = (p—1)?(1+.a(m— 1). 
If p = 2” —c is such that 





logy (1 + o(m — 1)) +2logs lel <n, (2.4) 


then reduction can be done with at most two multiplications by c. As an example, if 
p = 273 — 165 and f(z) = z° —2, then 


log,(1 + @(m — 1)) +2 log, |c| = log, 11+2log, 165 <n = 28 


and condition (2.4) is satisfied. 

If accumulation is in a series of registers each of size W bits, then selecting p = 
2” —c with n < W allows several terms to be accumulated in two registers (rather 
than spilling into a third register or requiring a partial reduction). The example with 
p = 2°8 — 165 is attractive in this sense if W = 32. However, this strategy competes 
with optimal use of the integer multiply, and hence may not be effective if it requires 
use of a larger m to obtain a field of sufficient size. 


Example 2.56 (accumulation strategies) Consider the OEF defined by p = 23! —1 and 
f(z) =z°—7, ona machine with wordsize W = 32. Since this is a Type I OEF, subfield 
reduction is especially simple, and a combination of partial reduction with accumula- 
tion may be effective in finding c, mod p. Although reduction into a single register 
after each operation may be prohibitively expensive, an accumulation into two registers 
(with some partial reductions) or into three registers can be employed. 

Suppose the accumulator consists of two registers. A partial reduction may be per- 
formed on each term of the form ajb; by writing ajb; = 232 +u and then 2v + u is 
added to the accumulator. Similarly, the accumulator itself could be partially reduced 
after the addition of a product a;b;. 

If the accumulator is three words, then the partial reductions are unnecessary, and a 
portion of the accumulation involves only two registers. On the other hand, the mul- 
tiplication by wm = 7 and the final reduction are slightly more complicated than in the 
two-register approach. 


The multiply-and-accumulate strategies also apply to field squaring in Fp”. Squaring 
requires a total of m+ G ) = m(m-+ 1)/2 integer multiplications (and possibly m — 1 
multiplications by w). The cost of the IF’, reductions depends on the method; in partic- 
ular, if only a single reduction is used in finding c,, then the number of reductions is 
the same as for general multiplication. 
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2.4.3 Inversion 


Inversion of a € Fy”, a #0, finds a polynomial ae Fp» such that j= 


(mod f). Variants of the Euclidean Algorithm have been proposed for use with OEFs. 

However, the action of the Frobenius map along with the special form of f can be used 

to obtain an inversion method that is among the fastest. The method is also relatively 

simple to implement efficiently once field multiplication is written, since only a few 

multiplications are needed to reduce inversion in F’,» to inversion in the subfield F p. 
Algorithm 2.59 computes 


a! =(a")~!a"~! mod f (2.5) 


where 
pot m-1 2 
r= i =p +---+p*+pt+l. 





Since (a”)?—! = 1 (mod p'"), it follows that a” € Fy. Hence a suitable algorithm may 
be applied for inversion in F’, in order to compute the term (a?) in (5), 

Efficient calculation of a’~! = a?" ‘++ in (2.5) is performed by using properties 
of the Frobenius map ¢ : F yp» — F pm defined by g(a) = a”. Elements of Fp are fixed 
by this map. Hence, if a =am—1z”~°+---4 azz +a,z+ao, then 





pi Ab Anz" DP +--+ +.a,z" +.ag mod f. 


To reduce the powers of z modulo /, write a given nonnegative integer e ase =qm-+r, 
where g = |e/m]| andr =e mod m. Since f(z) =z” —a, it follows that 


ze = zamtr = aw! z" (mod Ff (2)). 


Notice that gy! (a) is somewhat simpler to evaluate if p = 1 (mod m). By Theorem 2.52, 
every prime factor of m divides p — 1. Necessarily, if m is square free, the condition 
p =1 (mod m) holds. The results are collected in the following theorem. 


Theorem 2.57 (action of Frobenius map iterates) Given an OEF with p = 2” —c and 
f(z) =z" — a, let the Frobenius map on Fy» be given by gy: at» a? mod f. 


(i) The ith iterate of gy is the map 
m—1 ; 
giiar > o ajwlri/m zJp! mod m_ 
j=0 


(ii) If m is square-free, then p = 1 (mod m) and hence jp! mod m = j for all 0 < 
jJ<m-l. 
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The values z° = wl¢/™Jz¢med™ (mod f(z)) may be precomputed for e = jp! of 
interest, in which case g! (a) may be evaluated with only m — 1 multiplications in F p 
Use of an addition chain then efficiently finds a’~! in equation (2.5) using a few field 
multiplications and applications of g’. 


Example 2.58 (calculating a’~'!) The OEF defined by p = 27! —1 and f(z) =z°—7 
has r—1= p?+ p*t+---+ p. We may calculate a’—! using the sequence indicated in 
Table 2.2 (an addition-chain-like method) for m = 6. Evaluation of g and ? uses the 
precomputed values in Table 2.3 obtained from Theorem 2.57. 


m=3 m=5 


T <aP Tj <a? 
T <Ta=a?t! T, <—Tja =a?Pt! T) —T,a =a?Pt! 


2 2 3:43,2 2 3 
a’! TP =a? +P | TP =aP +P T3<-TP =a? 


3 2 3 
T, <T,T) =a? +p tp+l T) —13T) =a? 
wd 
e 5 
a’!<—tTP T,<—TP =a? 
geo! <—ThT| 


Cost: 1M +29 Cost: 2M + 39 Cost: 3M + 39 


pr -1 

















Table 2.2. Computation ofa’~! forr = —-,m € {3,5, 6}. The final row indicates the cost in 


Fp multiplications (M ) and applications of an iterate of the Frobenius map (9). 


zIP = @lJP/™\ 23 (mod f) ZIP” =q@liP/™\ 23 (mod f) 
zP = 1513477736z zP° = 15134777352 
22P = 151347773522 22P” = 63400591122 


AP = 91474896469 =e | oP = 128 
z4P = 63400591124 z4P* = 151347773524 
25P = 63400591225 2P” = 63400591125 





Table 2.3. Precomputation for evaluating gi € {1,2}, in the case p = ae — 1 and f(z) = —7 
(cf. Example 2.58). Ifa = a5z> +++:+a,z+ag € F )°, then gi(a)=a? = Liga joliP/m zi 
(mod f). , 





1 


In general, if w(x) is the Hamming weight of the integer x, then a’~* can be 


calculated with 


t1(m) = [logy(m — 1)] +w(m—1)—-1 
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multiplications in F,, and 


tim) +1, m odd, 
tr(m) = 4 j = Llogy(m—1)] +1, m = 2/ for some j, 
Llog,(m—1)|]+w(m)—1, _ otherwise, 


applications of Frobenius map iterates. Since f2(m) < t,(m) +1, the time for calculating 
a’—! with m > 2 is dominated by the multiplications in F p™ (each of which is much 
more expensive than the m — | multiplications in F’, needed for evaluation of ¢'). 


Algorithm 2.59 OEF inversion 


INPUT: a € Fp», a #0. 

OUTPUT: The element a~! € Fy” such that aa~' =1 (mod f). 
1. Use an addition-chain approach to find a’—!, where r = (p™ —1)/(p—1). 
2. cea’ =a" ae Fp. 
3. Obtain c~! such that cc~! = 1 (mod p) via an inversion algorithm in F p 
4. Return(c7!a"—!). 


Note 2.60 (implementation details for Algorithm 2.59) 


(i) The element c in step 2 of Algorithm 2.59 belongs to F,. Hence, only arith- 
metic contributing to the constant term of a’~!a need be performed (requiring 
m multiplications of elements in F,, and a multiplication by a). 


(ii) Since c7! EF p» the multiplication in step 4 requires only m F ,-multiplications. 


(iii) The running time is dominated by the ¢|(m) multiplications in F, in finding 
r—1 


a'~*, and the cost of the subfield inversion in step 3. 

The ratio //M of field inversion cost to multiplication cost is of fundamental interest. 
When m = 6, Algorithm 2.59 will require significantly more time than the ¢; (6) = 3 
multiplications involved in finding a’~!, since the time for subfield inversion (step 3) 
will be substantial. However, on general-purpose processors, the ratio is expected to be 
much smaller than the corresponding ratio in a prime field F, where g © p’”. 


2.5 Notes and further references 


§2.1 

For an introduction to the theory of finite fields, see the books of Koblitz [254] and 
McEliece [311]. A more comprehensive treatment is given by Lidl and Niederreiter 
[292]. 
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§2.2 

Menezes, van Oorshot, and Vanstone [319] concisely cover algorithms for ordinary 
and modular integer arithmetic of practical interest in cryptography. Knuth [249] is 
a standard reference. Kog [258] describes several (modular) multiplication methods, 
including classical and Karatsuba-Ofman, a method which interleaves multiplication 
with reduction, and Montgomery multiplication. 


The decision to base multiplication on operand scanning (Algorithm 2.9) or product 
scanning (Algorithm 2.10) is platform dependent. Generally speaking, Algorithm 2.9 
has more memory accesses, while Algorithm 2.10 has more complex control code 
unless loops are unrolled. Comba [101] compares the methods in detail for 16-bit In- 
tel 80286 processors, and the unrolled product-scanning versions were apparently the 
inspiration for the “comba’” routines in OpenSSL. 


Scott [416] discusses multiplication methods on three 32-bit Intel IA-32 processors 
(the 80486, Pentium, and Pentium Pro), and provides experimental results for mod- 
ular exponentiation with multiplication based on operand scanning, product scanning 
(Comba’s method), Karatsuba-Ofman with product scanning, and floating-point hard- 
ware. Multiplication with features introduced on newer IA-32 processors is discussed 
in §5.1.3. On the Motorola digital signal processor 56000, Dussé and Kaliski [127] 
note that extraction of U in the inner loop of Algorithm 2.9 is relatively expensive. 
The processor has a 56-bit accumulator but only signed multiplication of 24-bit quan- 
tities, and the product scanning approach in Montgomery multiplication is reportedly 
significantly faster. 


The multiplication method of Karatsuba-Ofman is due to Karatsuba and Ofman [239]. 
For integers of relatively small size, the savings in multiplications is often insufficient in 
Karatsuba-Ofman variants to make the methods competitive with optimized versions 
of classical algorithms. Knuth [249] and Kog [258] cover Karatsuba-Ofman in more 
detail. 


Barrett reduction (Algorithm 2.14) is due to Barrett [29]. Bosselaers, Govaerts, and 
Vandewalle [66] provide descriptions and comparative results for classical reduction 
and the reduction methods of Barrett and Montgomery. If the transformations and pre- 
computation are excluded, their results indicate that the methods are fairly similar in 
cost, with Montgomery reduction fastest and classical reduction likely to be slightly 
slower than Barrett reduction. These operation count comparisons are supported by 
implementation results on an Intel 80386 in portable C. De Win, Mister, Preneel and 
Wiener [111] report that the difference between Montgomery and Barrett reduction was 
negligible in their implementation on an Intel Pentium Pro of field arithmetic in F , for 
a 192-bit prime p. 


Montgomery reduction is due to Montgomery [330]. Kog, Acar, and Kaliski [260] 
analyze five Montgomery multiplication algorithms. The methods were identified as 
having a separate reduction phase or reduction integrated with multiplication, and 
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according to the general form of the multiplication as operand-scanning or product- 
scanning. Among the algorithms tested, they conclude that a “coarsely integrated 
operand scanning” method (where a reduction step follows a multiplication step at each 
index of an outer loop through one of the operands) is simplest and probably best for 
general-purpose processors. Kog and Acar [259] extend Montgomery multiplication to 
binary fields. 


The binary gcd algorithm (Algorithm 2.21) is due to Stein [451], and is analyzed by 
Knuth [249]. Bach and Shallit [23] provide a comprehensive analysis of several gcd 
algorithms. The binary algorithm for inversion (Algorithm 2.22) is adapted from the 
corresponding extended gcd algorithm. 


Lehmer [278] proposed a variant of the classical Euclidean algorithm which replaces 
most of the expensive multiple-precision divisions by single-precision operations. The 
algorithm is examined in detail by Knuth [249], and a slight modification is analyzed 
by Sorenson [450]. Durand [126] provides concise coverage of inversion algorithms 
adapted from the extended versions of the Euclidean, binary gcd, and Lehmer algo- 
rithms, along with timings for RSA and elliptic curve point multiplication on 32-bit 
RISC processors (for smartcards) from SGS-Thomson. On these processors, Lehmer’s 
method showed significant advantages, and in fact produced point multiplication times 
faster than was obtained with projective coordinates. 


Algorithm 2.23 for the partial Montgomery inverse is due to Kaliski [234]. De Win, 
Mister, Preneel and Wiener [111] report that an inversion method based on this algo- 
rithm was superior to variations of the extended Euclidean algorithm (Algorithm 2.19) 
in their tests on an Intel Pentium Pro, although details are not provided. The generaliza- 
tion in Algorithm 2.25 is due to Savas and Kog [403]; a similar algorithm is provided 
for finding the usual inverse. 


Simultaneous inversion (Algorithm 2.26) is attributed to Montgomery [331], where the 
technique was suggested for accelerating the elliptic curve method (ECM) of factoring. 
Cohen [99, Algorithm 10.3.4] gives an extended version of Algorithm 2.26, presented 
in the context of ECM. 


The NIST primes (§2.2.6) are given in the Federal Information Processing Standards 
(FIPS) publication 186-2 [140] on the Digital Signature Standard, as part of the recom- 
mended elliptic curves for US Government use. Solinas [445] discusses generalizations 
of Mersenne numbers 2* — 1 that permit fast reduction (without division); the NIST 
primes are special cases. 


§2.3 

Algorithms 2.35 and 2.36 for polynomial multiplication are due to Lopez and Dahab 
[301]. Their work expands on “comb” exponentiation methods of Lim and Lee [295]. 
Operation count comparisons and implementation results (on Intel family and Sun UI- 
traSPARC processors) suggest that Algorithm 2.36 will be significantly faster than 
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Algorithm 2.34 at relatively modest storage requirements. The multiple-table variants 
in Note 2.37 are essentially described by Lopez and Dahab [301, Remark 2]. 


The OpenSSL contribution by Sun Microsystems Laboratories mentioned in Note 2.38 
is authored by Sheueling Chang Shantz and Douglas Stebila. Our notes are based in 
part on OpenSSL-0.9.8 snapshots. A significant enchancement is discussed by Weimer- 
skirch, Stebila, and Chang Shantz [478]. Appendix C has a few notes on the OpenSSL 
library. 


The NIST reduction polynomials (§2.3.5) are given in the Federal Information Pro- 
cessing Standards (FIPS) publication 186-2 [140] on the Digital Signature Standard, as 
part of the recommended elliptic curves for US Government use. 


The binary algorithm for inversion (Algorithm 2.49) is the polynomial analogue of 
Algorithm 2.22. The almost inverse algorithm (Algorithm 2.50) is due to Schroeppel, 
Orman, O’ Malley, and Spatscheck [415]; a similar algorithm (Algorithm 2.23) in the 
context of Montgomery inversion was described by Kaliski [234]. 


Algorithms for field division were described by Goodman and Chandrakasan [177], 
Chang Shantz [90], Durand [126], and Schroeppel [412]. Inversion and division algo- 
rithm implementations are especially sensitive to compiler differences and processor 
characteristics, and rough operation count analysis can be misleading. Fong, Hanker- 
son, Lopez and Menezes [144] discuss inversion and division algorithm considerations 
and provide comparative timings for selected compilers on the Intel Pentium II and 
Sun UltraSPARC. 


In a normal basis representation, elements of F2” are expressed in terms of a basis 
of the form {£, A”, B, ee pr }. One advantage of normal bases is that squaring of 
a field element is a simple rotation of its vector representation. Mullin, Onyszchuk, 
Vanstone and Wilson [337] introduced the concept of an optimal normal basis in or- 
der to reduce the hardware complexity of multiplying field elements in F2” whose 
elements are represented using a normal basis. Hardware implementations of the arith- 
metic in Fy” using optimal normal bases are described by Agnew, Mullin, Onyszchuk 
and Vanstone [6] and Sunar and Kog [456]. 


Normal bases of low complexity, also known as Gaussian normal bases, were further 
studied by Ash, Blake and Vanstone [19]. Gaussian normal bases are explicitly de- 
scribed in the ANSI X9.62 standard [14] for the ECDSA. Experience has shown that 
optimal normal bases do not have any significant advantages over polynomial bases for 
hardware implementation. Moreover, field multiplication in software for normal basis 
representations is very slow in comparison to multiplication with a polynomial basis; 
see Reyhani-Masoleh and Hasan [390] and Ning and Yin [348]. 


§2.4 
Optimal extension fields were introduced by Bailey and Paar [25, 26]. Theorem 2.52 is 
from Lidl and Niederreiter [292, Theorem 3.75]. Theorem 2.57 corrects [26, Corollary 
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2]. The OFF construction algorithm of [26] has a minor flaw in the test for irreducibil- 
ity, leading to a few incorrect entries in their table of Type II OEFs (e.g, z7> — 2 is not 
irreducible when p = 2° —5). The inversion method of §2.4.3 given by Bailey and Paar 
is based on Itoh and Tsujii [217]; see also [183]. 


Lim and Hwang [293] give thorough coverage to various optimization strategies and 
provide useful benchmark timings on Intel and DEC processors. Their operation count 
analysis favours a Euclidean algorithm variant over Algorithm 2.59 for inversion. How- 
ever, rough operation counts at this level often fail to capture processor or compiler 
characteristics adequately, and in subsequent work [294] they note that Algorithm 2.59 
appears to be significantly faster in implementation on Intel Pentium II and DEC 
Alpha processors. Chung, Sim, and Lee [97] note that the count for the number of 
required Frobenius-map applications in inversion given in [26] is not necessarily min- 
imal. A revised formula is given, along with inversion algorithm comparisons and 
implementation results for a low-power Samsung CalmRISC 8-bit processor with a 
math coprocessor. 
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CHAPTER 3 


Elliptic Curve Arithmetic 


Cryptographic mechanisms based on elliptic curves depend on arithmetic involving the 
points of the curve. As noted in Chapter 2, curve arithmetic is defined in terms of un- 
derlying field operations, the efficiency of which is essential. Efficient curve operations 
are likewise crucial to performance. 


Figure 3.1 illustrates module framework required for a protocol such as the El- 
liptic Curve Digital Signature Algorithm (ECDSA, discussed in §4.4.1). The curve 
arithmetic not only is built on field operations, but in some cases also relies on big 
number and modular arithmetic (e.g., t-adic operations if Koblitz curves are used; 
see §3.4). ECDSA uses a hash function and certain modular operations, but the 
computationally-expensive steps involve curve operations. 
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Figure 3.1. ECDSA support modules. 
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§3.1 provides an introduction to elliptic curves. The group operations of addition 
and doubling for the points on an elliptic curve are given, along with fundamental 
structure and other properties. §3.2 presents projective-coordinate representations (and 
associated point addition and doubling algorithms), of principal interest when field 
inversion is expensive relative to field multiplication. §3.3 discusses strategies for point 
multiplication, the operation which dominates the execution time of schemes based on 
elliptic curves. 

The methods in §3.4, §3.5, and §3.6 are related in the sense that they all exploit en- 
domorphisms of the elliptic curve to reduce the cost of doubling in point multiplication. 
§3.4 discusses the special Koblitz curves, which allow point doubling for curves over 
F2 to be replaced by inexpensive field squaring operations. §3.5 examines a broader 
class of elliptic curves which admit endomorphisms that can be used efficiently to re- 
duce the number of doublings in point multiplication. Strategies in §3.6 for elliptic 
curves over binary fields replace most point doublings by a potentially faster halving 
operation. §3.7 contains operation count comparisons for selected point multiplication 
methods. §3.8 concludes with chapter notes and references. 


3.1 Introduction to elliptic curves 
Definition 3.1 An elliptic curve E over a field K is defined by an equation 
E:y? +ajxy +azy =x? +apx* +a4x +.a6 (3.1) 


where a1, a2, 43, 44, dg € K and A £0, where A is the discriminant of E and is defined 
as follows: 





A = —d3dg — 8d} — 27d? + 9dodadg 





d2 = a; + 4a 
d4 = 2a4 +4143 (3.2) 
do = a3 + 4a6 


2 Di. 2 
dg = aj dg + 4a2d6 — a1a3a4 + a2a3 — a4. 


If L is any extension field of K, then the set of L-rational points on E is 


E(L) ={(x,y) €L x L:y? +ajxy +a3y x? — ox? —agx a6 = O} U {oo} 





where oo is the point at infinity. 
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(a) Ey: y2 =x3—x (b) Ep: y? =x3 + 4x43 


Figure 3.2. Elliptic curves over R. 


Remark 3.2 (comments on Definition 3.1) 
(i) Equation (3.1) is called a Weierstrass equation. 
(ii) We say that E is defined over K because the coefficients a1, a2, a3, a4, d@ of its 
defining equation are elements of K. We sometimes write E/K to emphasize 


that E is defined over K, and K is called the underlying field. Note that if E is 
defined over K, then E is also defined over any extension field of K. 


(iii) The condition A £0 ensures that the elliptic curve is “smooth”, that is, there are 
no points at which the curve has two or more distinct tangent lines. 
(iv) The point oo is the only point on the line at infinity that satisfies the projective 
form of the Weierstrass equation (see §3.2). 
(v) The L-rational points on E are the points (x, y) that satisfy the equation of 


the curve and whose coordinates x and y belong to L. The point at infinity is 
considered an L-rational point for all extension fields L of K. 


Example 3.3 (elliptic curves over IR) Consider the elliptic curves 


E\: y? =x?-x 
1 5 
Ey:y =x +oet7 
defined over the field IR of real numbers. The points £; (IR) \ {oo} and F(R) \ {oo} are 


graphed in Figure 3.2. 
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3.1.1 Simplified Weierstrass equations 


Definition 3.4 Two elliptic curves EF; and FE» defined over K and given by the 
Weierstrass equations 





Ey :y? atxy azy = x +azx” +a4x +.a6 











Ey: y* a\xy G3y = x° +x? + Hyx +% 


are said to be isomorphic over K if there exist u,r,s,t € K,u 40, such that the change 
of variables 
(x,y) —> (u?x Lr, wy + u>sx 4 t) (3.3) 





transforms equation £; into equation E>. The transformation (3.3) is called an 
admissible change of variables. 


A Weierstrass equation 
BE: y? +ayxy +a3y =x? +agx* + d4x + 6 


defined over K can be simplified considerably by applying admissible changes of vari- 
ables. The simplified equations will be used throughout the remainder of this book. We 
consider separately the cases where the underlying field K has characteristic different 
from 2 and 3, or has characteristic equal to 2 or 3. 


1. If the characteristic of K is not equal to 2 or 3, then the admissible change of 
variables 


x—3aj—12a, y—3ax a} +4ayaz — 12a3 
(x, y) > ————— a 
36 216 24 
transforms E to the curve 
y=xi+ax+b (3.4) 
where a,b € K. The discriminant of this curve is A = —16(4a? + 27b”). 


2. If the characteristic of K is 2, then there are two cases to consider. If aj 4 0, then 
the admissible change of variables 


2 2 
a3 ayda4+a 
(x,y) (ax+8 ajpyt+ + :) 
il ay 


transforms E to the curve 
y? tay =x? +ax? +b (3.5) 


where a,b € K. Suchacurve is said to be non-supersingular (cf. Definition 3.10) 
and has discriminant A = b. If a; = 0, then the admissible change of variables 


(x,y) > (« +42, y) 
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transforms E to the curve 
yteysxi+ax+b (3.6) 
where a,b,c € K. Such a curve is said to be supersingular (cf. Definition 3.10) 


and has discriminant A = c*. 


3. If the characteristic of K is 3, then there are two cases to consider. If a; #—a2, 
then the admissible change of variables 





ej ie = 
x, X+—,ytajx+a;— +43 ], 
y hee 1 ‘a 3 


where dz = a + ay and d4 = a4 — a 14q3, transforms E to the curve 
y- =x?+ax?+b (3.7) 


where a,b € K. Such a curve is said to be non-supersingular and has 
discriminant A = —a%b. If a = —an, then the admissible change of variables 


(x,y) > (x, y+ a,x +43) 


transforms EF to the curve 
y=x+ax+b (3.8) 


where a,b € K. Such a curve is said to be supersingular and has discriminant 
A=-a’?. 


3.1.2 Group law 


Let E be an elliptic curve defined over the field K. There is a chord-and-tangent rule 
for adding two points in E (K) to give a third point in E (K ). Together with this addition 
operation, the set of points E(K) forms an abelian group with oo serving as its identity. 
It is this group that is used in the construction of elliptic curve cryptographic systems. 

The addition rule is best explained geometrically. Let P = (x1, y1) and Q = (x2, y2) 
be two distinct points on an elliptic curve E. Then the sum R, of P and Q, is defined 
as follows. First draw a line through P and Q; this line intersects the elliptic curve at 
a third point. Then R is the reflection of this point about the x-axis. This is depicted in 
Figure 3.3(a). 

The double R, of P, is defined as follows. First draw the tangent line to the elliptic 
curve at P. This line intersects the elliptic curve at a second point. Then R is the 
reflection of this point about the x-axis. This is depicted in Figure 3.3(b). 

Algebraic formulas for the group law can be derived from the geometric description. 
These formulas are presented next for elliptic curves EF of the simplified Weierstrass 
form (3.4) in affine coordinates when the characteristic of the underlying field K is not 
2 or 3 (e.g., K =F» where p > 3 is a prime), for non-supersingular elliptic curves E of 
the form (3.5) over K = Fo, and for supersingular elliptic curves E of the form (3.6) 
over K =F. 
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y- y 
P= ; nae 
Q = (x, 2) er 
Pi x LJ x 
P= (x1, y1) 
\ 
R = (33, y3) R = (x3, y3) 
(a) Addition: P+O=R. (b) Doubling: P+ P= R. 


Figure 3.3. Geometric addition and doubling of elliptic curve points. 


Group law for E/K : y2 =x? +ax +b, char(K) # 2,3 
1. Identity. P+oo=co+ P= P forall Pe E(K). 


2. Negatives. If P = (x, y) € E(K), then (x, y)+ (x, —y) = o~. The point (x, —y) 
is denoted by — P and is called the negative of P; note that — P is indeed a point 
in E(K). Also, —co = co. 


3. Point addition. Let P = (x1, y1) € E(K) and Q = (x, y2) € E(K), where P # 
+Q. Then P+ Q = (x3, y3), where 


yo—y\" y-y1 
x3= —x1—x2 and y3 = | ——— } (1 —-43)—y1.- 


xX2—-X] X2—X1 


4. Point doubling. Let P = (x1, y,;) € E(K), where P 4 —P. Then 2P = (x3, y3), 


where 
3x2 +a\? 3x7 +a 
n= ( ! ) -2m and n=( oF Jor-my—n. 
1 








Example 3.5 (elliptic curve over the prime field F29) Let p = 29, a = 4, and b = 20, 
and consider the elliptic curve 


E: y =x +4x +20 


defined over F29. Note that A = —16(4a3 +27b?) = —176896 £0 (mod 29), so E is 
indeed an elliptic curve. The points in E (29) are the following: 
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oo (2,6) (4,19) (8,10) (13,23) (16,2) (19,16) (27,2) 
(0,7) (2,23) (5,7) ~—s (8,19) (14,6) ~— (16,27) (20,3) ~—(27,27) 
(0,22) (3,1) (5,22) (10,4) (14,23) (17,10) ~— (20, 26) 

(1,5) (3,28) (6,12) (10,25) (15,2) (17,19) (24,7) 

(1,24) (4,10) (6,17) (13,6) (15,27) (19,13) (24,22) 


Examples of elliptic curve addition are (5,22) + (16,27) = (13,6), and 2(5,22) = 
(14, 6). 


Group law for non-supersingular E /Fy : y2+xy =x? +ax2+b 
1. Identity. P+oo =co+ P= P forall P € E(Fo~). 


2. Negatives. If P = (x,y) € E(F2), then (x, y) + (x%,x + y) = oo. The point 
(x,x + y) is denoted by —P and is called the negative of P; note that —P is 
indeed a point in E(F2”). Also, —oo = 00. 

3. Point addition. Let P = (x1, yj) € E(Fo) and Q = (x2, y2) € E(F2”), where 
PA#A+Q. Then P+ Q = (x3, y3), where 





x= a +A+x,;+x2+a and y3=A(xy +43) 4234+) 


with A = (y1 + y2)/(x1 +2). 
4. Point doubling. Let P = (x1, y;) € E(F2), where P 4 —P. Then 2P = (x3, y3), 
where 


b 
x3 =Vtataaxe+ 


a and 3 =x +Ax3 +x3 
x} 


with A = xj + y1/X1. 


Example 3.6 (non-supersingular elliptic curve over F54) Consider the finite field F 4 
as represented by the reduction polynomial f(z) = z++z+ 1 (cf. Example 2.2). An 
element a3z> + apz* +. a,z+a9 F4 is represented by the bit string (a3a2a,a9) of 
length 4; for example, (0101) represents 24+1.Leta=z3,b=z3 +1, and consider 
the non-supersingular elliptic curve 





E:y try =x tx? 4+ (241) 


defined over F,4. The points in E(F 54) are the following: 


oo (0011, 1100) (1000, 0001) (1100, 0000) 
(0000, 1011) (0011, 1111) (1000, 1001) (1100, 1100) 
(0001, 0000) (0101, 0000) (1001, 0110) (1111, 0100) 
(0001, 0001) (0101, 0101) (1001, 1111) (1111, 1011) 
(0010, 1101) (0111, 1011) (1011, 0010) 
(0010, 1111) (0111, 1100) (1011, 1001) 


Examples of elliptic curve addition are (0010, 1111) + (1100, 1100) = (0001, 0001), 
and 2(0010, 1111) = (1011, 0010). 
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Group law for supersingular E /Fy : y2+cy =x>+ax +b 


1. Identity. P+-oo =co+ P= P forall P € E(Fo~). 

2. Negatives. If P = (x,y) € E(F2), then (x, y) + (x, y +c) = oo. The point 
(x,y +c) is denoted by —P and is called the negative of P; note that —P is 
indeed a point in E(F2”). Also, —oo = 00. 

3. Point addition. Let P = (x1, y1) € E(Fo~) and Q = (x2, y2) € E(F2”), where 
PA#A+Q. Then P+ Q = (x3, y3), where 


2 
+ ay: 
n= (242) ee and y= (222) on tant yi te 
XX, +Xx2 t x9 


X17 





4. Point doubling. Let P = (x1, yj) € E(F2), where P 4 —P. Then 2P = (x3, y3), 
where 





2 2 2 
xy; +a x; +a 
n= ( — ) and = (2) on tant yi te 


3.1.3 Group order 


1 


Let E be an elliptic curve defined over F,. The number of points in E(F,), denoted 
#E(F,), is called the order of E over Fy. Since the Weierstrass equation (3.1) has 
at most two solutions for each x € Fz, we know that #E(F,) € [1, 2g + 1]. Hasse’s 
theorem provides tighter bounds for #E (Iq). 








Theorem 3.7 (Hasse) Let E be an elliptic curve defined over F',. Then 


g+1-2./q < #E(@,) < q+14+2,/¢@. 


The interval [¢ + 1 —2,/¢,q+1+2,/q] is called the Hasse interval. An alternate 
formulation of Hasse’s theorem is the following: if E is defined over F,, then #E(F,) = 
q+1—t where |t| < 2,/q; t is called the trace of E over Fz. Since 2,/q is small relative 
to g, we have #E(F,) © q. 

The next result determines the possible values for #E(F,) as E ranges over all 
elliptic curves defined over Fy. 


Theorem 3.8 (admissible orders of elliptic curves) Let q = p™ where p is the charac- 
teristic of F,. There exists an elliptic curve E defined over F, with #E(F,) =q+1-t 
if and only if one of the following conditions holds: 


(i) t £0 (mod p) and t? < 4q. 
(ii) _m is odd and either (a) t = 0; or (b) t? = 2q and p = 2; or (c) t= 3g and p =3. 
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(iii) m is even and either (a) 1? = 4q; or (b) t= gq and p #1 (mod 3); or (c) t =0 
and p #1 (mod 4). 
A consequence of Theorem 3.8 is that for any prime p and integer ¢ satisfying 


|t| < 2./p, there exists an elliptic curve E over F, with #E(F,) = p+1—t. This 
is illustrated in Example 3.9. 


Example 3.9 (orders of elliptic curves over F37) Let p = 37. Table 3.1 lists, for each 
integer n in the Hasse interval [37 + 1 —2V37,37+ 1+237], the coefficients (a, b) 
of an elliptic curve E : y? = x? +ax +b defined over F37 with #E (F37) =n. 





Table 3.1. The admissible orders n = #E(F37) of elliptic curves E : y? = x3 +ax +b defined 
over F37. 


The order #£ (IF, ) can be used to define supersingularity of an elliptic curve. 


Definition 3.10 Let p be the characteristic of F,. An elliptic curve E defined over Fy 
is supersingular if p divides t, where tf is the trace. If p does not divide ¢, then E is 
non-supersingular. 


If E is an elliptic curve defined over F,, then E is also defined over any extension 
Fj” of F,. The group E(F,) of F,-rational points is a subgroup of the group E(F,”) 
of Fg”-rational points and hence #E(F,) divides #E(F,”). If #E(F,) is known, then 
#E (Fg) can be efficiently determined by the following result. 


Theorem 3.11 Let E be an elliptic curve defined over F,, and let #E(F,) =q +1 —t. 
Then #E(F,") = q" + 1—V, for all n > 2, where {V,} is the sequence defined 
recursively by Vo = 2, Vj =t, and V, = Vj Vn-1 — g Vn—2 for n > 2. 


3.1.4 Group structure 


Theorem 3.12 describes the group structure of E(IF,). We use Z, to denote a cyclic 
group of order n. 


Theorem 3.12 (group structure of an elliptic curve) Let E be an elliptic curve defined 
over Fy. Then E(F,Z) is isomorphic to Z,, @ Z,, where n; and nz are uniquely 
determined positive integers such that nz divides both n; and g — 1. 
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Note that #E(F,) = njnz. If nz = 1, then E(,) is a cyclic group. If nz > 1, then 
E (Fg) is said to have rank 2. If nz is a small integer (e.g., n = 2,3 or 4), we sometimes 
say that E(F,) is almost cyclic. Since nz divides both n; and q — 1, one expects that 
E(F,) is cyclic or almost cyclic for most elliptic curves E over Fy. 





Example 3.13 (group structure) The elliptic curve E : y> =x? +4x +20 defined over 
F9 (cf. Example 3.5) has #£ (F29) = 37. Since 37 is prime, E (29) is a cyclic group 
and any point in E' (9) except for oo is a generator of E(F29). The following shows 
that the multiples of the point P = (1,5) generate all the points in E (9). 





OP =00 8P=(8,10) 16P=(0,22) 24P=(16,2) 32P=(6,17) 
1P =(1,5) 9P = (14,23) 17P=(27,2) 25P=(19,16) 33P=(15,2) 
2P=(4,19) 10P = (13,23) 18P=(2,23) 26P=(10,4) 34P = (20,26) 
3P =(20,3) 11P=(10,25) 19P=(2,6) 27P=(13,6) 35P=(4,10) 
4P = (15,27) 12P=(19,13) 20P =(27,27) 28P=(14,6) 36P =(1,24) 
5P=(6,12) 13P=(16,27) 21P=(0,7) 29P =(8,19) 

6P=(17,19) 14P=(5,22) 22P=(3,28) 30P =(24,7) 

7P = (24,22) 15P=(3,1) 23P=(5,7) 31P=(17,10) 


Example 3.14 (group structure) Consider F 4 as represented by the reduction polyno- 
mial f(z) = z++z+1. The elliptic curve E : yr txy = x34 73x? + (2341) defined 
over F544 has #£ (F 54) = 22 (cf. Example 3.6). Since 22 does not have any repeated fac- 
tors, E (54) is cyclic. The point P = (z3, 1) = (1000, 0001) has order 11; its multiples 
are shown below. 


OP=c 3P = (1100,0000) 6P=(1011,1001) 9P=(1001,0110) 
1P =(1000,0001) 4P=(1111,1011) 7P=(1111,0100) 10P = (1000, 1001) 
2P =(1001,1111) 5P=(1011,0010) 8P = (1100, 1100) 


3.1.5 Isomorphism classes 


Recall the definition of isomorphic elliptic curves (Definition 3.4). The relation of iso- 
morphism is an equivalence relation on the set of elliptic curves defined over a finite 
field K. If two elliptic curves EF; and E> are isomorphic over K, then their groups 
FE (K) and E2(K) of K-rational points are also isomorphic. However, the converse is 
not true (cf. Examples 3.16 and 3.17). We present some results on the isomorphism 
classes of elliptic curves defined over finite fields of characteristic not equal to 2 or 3, 
and for non-supersingular elliptic curves defined over binary fields. 


Theorem 3.15 (isomorphism classes of elliptic curves) Let K = F, be a finite field 
with char(K) 42,3. 
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(i) The elliptic curves 


Ey: y* =x +ax+b (3.9) 
Ex: y7 =x? ax+b (3.10) 














defined over K are isomorphic over K if and only if there exists u € K* such that 


ua = a and u°b = b. If such a u exists, then the admissible change of variables 


(x, y) > (u?x, wy) 


transforms equation (3.9) into equation (3.10). 


(ii) The number of isomorphism classes of elliptic curves over K is 2g + 6, 2g +2, 
2q +4, 2q, for g =1,5,7,11 (mod 12) respectively. 


Example 3.16 (isomorphism classes of elliptic curves over Fs) Table 3.2 lists the 12 
isomorphism classes of elliptic curves over F'5. Note that if the groups £;(F,) and 
E(F,) of F,-rational points are isomorphic, then this does not imply that the elliptic 
curves E; and E4 are isomorphic over F,. For example, the elliptic curves Fj : y= 
x3 +1 and £2: y? =x>+2 are not isomorphic over Fs, but E,(F5) and E2(F5) both 


have order 6 and therefore both groups are isomorphic to Ze. 


Isomorphism #E(F5) | Group structure 
class of E(F5) 


{y2=2x3 +1, y2=x344} 



































Table 3.2. Isomorphism classes of elliptic curves E over Fs. 


Example 3.17 Let p = 73. It is easy to verify using Theorem 3.15 that the elliptic 
curves 


E\: y = x3 4+25x 
Ey: y? =x° +53x +55 
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defined over F,, are not isomorphic over F,. However, the groups Fj (Fp) and 
E2(F pm) of F pm-rational points are isomorphic for every m > 1. 


Theorem 3.18 (isomorphism classes of elliptic curves over a binary field) Let K = 
Fm” be a binary field. 


(i) The non-supersingular elliptic curves 
Ey ytxy=x? ax? +b (3.11) 
Ep: y* +xy= aoa (3.12) 

















defined over K are isomorphic over K if and only if b = b and Tr(a) = Tr(a), 
where Tr is the trace function (see Definition 3.78). If these conditions are satis- 
fied, then there exists s € Fy» such that @ = s* +s +a, and the admissible change 
of variables 

(x,y) > @, y+sx) 
transforms equation (3.11) into equation (3.12). 


(ii) The number of isomorphism classes of non-supersingular elliptic curves over 
K is 2"*! —2. Let y € Fo satisfy Tr(y) = 1. A set of representatives of the 
isomorphism classes is 


(7? +xy =x? +ax*+bl|ae{0,y}, b € Fon}. 


(iii) The order #£(F2”) of the non-supersingular elliptic curve E : y? t+xy= xo 
yx +b is divisible by 2. If Tr(y) = 0, then #£ (Fz) is divisible by 4. 


3.2 Point representation and the group law 


Formulas for adding two elliptic points were presented in §3.1 for the elliptic curves 
y? = x3 +.ax +b defined over a field K of characteristic that is neither 2 nor 3, and for 
y? +xy = x3 +ax? +b defined over a binary field K. For both curves, the formulas 
for point addition (i.e., adding two distinct finite points that are not negatives of each 
other) and point doubling require a field inversion and several field multiplications. 
If inversion in K is significantly more expensive than multiplication, then it may be 
advantageous to represent points using projective coordinates. 


3.2.1 Projective coordinates 


Let K be a field, and let c and d be positive integers. One can define an equivalence 
relation ~ on the set K>\{(0,0, 0)} of nonzero triples over K by 


(X1, V1, Z1) ~ (Xo, Yo, Z2) if X1 = ASX, VY =14Yp, Z| =AZp> for some A € K*. 
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The equivalence class containing (X, Y, Z) € K>\{(0, 0, 0)} is 
(X:¥:Z)={ACX,A2Y,AZ) 4 © K*}. 


(X : Y : Z) is called a projective point, and (X, Y, Z) is called a representative of (X : 
Y : Z). The set of all projective points is denoted by P(K). Notice that if (X’, Y’, Z!) € 
(X :Y: Z) then (X’: Y’: Z') =(X : Y: Z); that is, any element of an equivalence 
class can serve as its representative. In particular, if Z 4 0, then (X/Z°, Y/Z4, 1) isa 
representative of the projective point (X : Y : Z), and in fact is the only representative 
with Z-coordinate equal to 1. Thus we have a 1-1 correspondence between the set of 
projective points 


PK Herve 2)e XY 2ek 20) 
and the set of affine points 
A(K) = {(x, y): x, ye K}. 
The set of projective points 
P(K)® ={(X:¥:Z):X,Y,Z€K,Z=0} 


is called the line at infinity since its points do not correspond to any of the affine points. 

The projective form of Weierstrass equation (3.1) of an elliptic curve E defined over 
K is obtained by replacing x by X/Z° and y by Y/Z%, and clearing denominators. 
Now, if (X,Y, Z) € K? \{(0, 0, 0)} satisfies the projective equation then so does any 
(X’, Y’,Z') € (X : Y : Z). Therefore it makes sense to say that the projective point 
(X : Y : Z) lies on E. We thus have a 1-1 correspondence between the affine points in 
A(K) that lie on E and the projective points in P(K)* that lie on E. The projective 
points in P(K)° which lie on E are the points at infinity on E. 


Example 3.19 (standard projective coordinates) Let c = 1 and d = 1. Then the 
projective form of the Weierstrass equation 


E: y* +ayxy +a3y = x? +.ayx* +a4x + d6 
defined over K is 
V2 axXYZ+@V7 jk +o 74k ae. 


The only point on the line at infinity that also lies on E is (0: 1 : 0). This projective 
point corresponds to the point oo in Definition 3.1. 


Formulas that do not involve field inversions for adding and doubling points in pro- 
jective coordinates can be derived by first converting the points to affine coordinates, 
then using the formulas from §3.1 to add the affine points, and finally clearing denom- 
inators. Also of use in point multiplication methods (see §3.3) is the addition of two 
points in mixed coordinates—where the two points are given in different coordinate 
systems. 
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Example 3.20 (addition formulas using Jacobian coordinates) Let c = 2 and d = 3. 
The projective point (X : Y : Z), Z #0, corresponds to the affine point (X/Z?, Y/Z°). 
The projective form of the Weierstrass equation 


idee yr =x +ax+b 
defined over K is 
Y? =x? 4ax7* 457°. 


The point at infinity oo corresponds to (1: 1 : 0), while the negative of (X : Y : Z) is 
(X:-Y:Z). 


Point doubling. Let P = (X, : Y; : Z,) € E, and suppose that P 4 —P. Since P = 
(X1/ Ze :Y,/ Zz : 1), we can use the doubling formula for F in affine coordinates to 
compute 2P = (X4: ¥: 1), obtaining 


x4 

Ste) x1 XP +az{y—38Ki¥? 
Y oo 272 

24, Z 4¥2Z- 





1 


and 





Xt 
WOR Te a Ph Sez ey, 2 
Y= : ee pe, ee 
21 Z Zi 2Y|Z] Zz, Zz 


To eliminate denominators in the expressions for X4 and Y3, we set X3 = X4- Z3 and 
3 = Y; . Zz where Z3 = 2Y,Z,, and obtain the following formulas for computing 
2P = (X3: Y3: Z3) in Jacobian coordinates: 


X3 = GX? +aZ})* —8X1¥? 
Y¥3 = GX4 +aZ})(4X1Y7 — X3)-8Y} (3.13) 
Z4=2Y,Z}. 





By storing some intermediate elements, X3, Y3 and Z3 can be computed using six field 
squarings and four field multiplications as follows: 


A<Y;, B<4X)-A, C<8A’, D<3X?+a-Zi, 
Naa 2B. Yee D (B= Xa) =C,. Zee 2Y |Z), 


Point addition using mixed Jacobian-affine coordinates. Let P = (X, : Y, : Z1) € E, 
Z, £0, and Q = (X2: Y2: 1), and suppose that P 4 +Q. Since P = (az: : KijZ; ; 


3.2. Point representation and the group law 8&9 


1), we can use the addition formula for E in affine coordinates to compute P + Q = 
(x4: 4: 1), obtaining 


Yy-4 : 3 2 
Z XxX Y2Z,—Y1 XxX 
Se ee, fp ea ee ey 
ame ees QD 2 \ KZ? — XZ a 
az 1 244 1 1 1 


Y,— 24 
yo (227) (yp) te (tian) (iy) te 
eo z Ze Ee Nez =a) Zr) \ 27 a 
To eliminate denominators in the expressions for X4 and Yj, we set X3 = X4- Z3 
and Y3 = Y5 . Z where Z3 = (X2Z? — X,)Zj,, and obtain the following formulas for 
computing P+ Q = (X3: Y3: Z3) in Jacobian coordinates: 
X3 = (¥2Z] — V1)? — (KoZj — X1)°(X1 + X2Z7) 
¥3 = (YoZ} — Yi)(X1 (X2Zj — X1)° — X3) — My (X2Z7 — Xi? (3.14) 
Zg = (XoZ{—X1)Z1. 


By storing some intermediate elements, X3, Y3 and Z3 can be computed using three 
field squarings and eight field multiplications as follows: 


A<Z?, B<Z,-A, C<X2-A, DeY-B, E<C—Xi, 
FepD=)).. Ger. A2Gir.. Texce 
Noe PF? -(F 420), Hye Pe =Xej=—Vi-H, Zoe 7k. 


3.2.2 The elliptic curve y? = x* +ax +b 


This subsection considers coordinate systems and addition formulas for the elliptic 
curve E : y* = x>+ax +b defined over a field K whose characteristic is neither 2 nor 
3. Several types of projective coordinates have been proposed. 
1. Standard projective coordinates. Here c = 1 and d = 1. The projective point 
(X :Y:Z), Z £0, corresponds to the affine point (X/Z, Y/Z). The projective 
equation of the elliptic curve is 


¥?Z=X?4+aXxXZ*+0Z?. 


The point at infinity oo corresponds to (0: 1 : 0), while the negative of (X : Y : Z) 
is (X:—Y:Z). 
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2. Jacobian projective coordinates. Here c = 2 and d = 3. The projective point 


(X:Y:Z), Z £0, corresponds to the affine point (X/Z*, Y/Z>).The projective 
equation of the elliptic curve is 


Y7=X34+4XZ*+52Z". 


The point at infinity oo corresponds to (1 : 1 : 0), while the negative of (X : Y : Z) 
is (X : —Y : Z). Doubling and addition formulas were derived in Example 3.20. 
If a = —3, the expression 3X7 +aZ} that occurs in the doubling formula (3.13) 
can be computed using only one field multiplication and one field squaring since 


3X? —3Z} =3(X1 — Z2)-(X1 +Z). 


Henceforth, we shall assume that the elliptic curve y? = x3 +ax +b has a = —3. 
Theorem 3.15 confirms that the selection is without much loss of generality. 
Point doubling can be further accelerated by using the fact that 2Y| appears sev- 
eral times in (3.13) and trading multiplications by 4 and 8 for divisions by 2. The 
revised doubling formulas are: 


A<3(X1—Zi)-(X14+- Zi), B<2¥1, Z3<B-Z1, C<B’, 
DeaCeXy,. Xe A= 2D,. ee(D = Xa)- A= C7 PD, 


The point doubling and point addition procedures for the case a = —3 are given 
in Algorithms 3.21 and 3.22 where an effort was made to minimize the number 
of temporary variables T;. The algorithms are written in terms of basic field op- 
erations; however, specialized routines consisting of integrated basic operations 
may be advantageous (see §5.1.2 for a concrete example when floating-point 
hardware is used). 


. Chudnovsky coordinates. Here the Jacobian point (X : Y : Z) is represented 


as (X:¥:Z:Z?: Z>). The redundancy in this representation is beneficial in 
some point multiplication methods where additions are performed in projective 
coordinates. 


3.2. Point representation and the group law 


Algorithm 3.21 Point doubling (y? = x* — 3x +b, Jacobian coordinates) 


INPUT: P = (X1: Y, : Z1) in Jacobian coordinates on E'/K : y? =x3—3x+b. 
OUTPUT: 2P = (X3: Y3: Z3) in Jacobian coordinates. 


Algorithm 3.22 Point addition (y* = x? — 3x +b, affine-Jacobian coordinates) 
INPUT: P = (X,: Y; : Z;) in Jacobian coordinates, Q = (x2, y2) in affine coordinates 


If P = oo then return(oc). 
T, —Z;j. 
Tr = X{ = T| : 


. Tl —xX,4+7}. 


T,<T)-T}. 


. In<3T). 
7 ¥3<-2Y). 
A Z3<—¥3-Z1. 
. Y3<Y3. 

: T3 << Y3-X 1. 

. ¥3<-¥3. 

- Y3 <— Y3/2. 

. X3<T5. 

. T, <—2T3. 

7 X3<X3-T}. 
fi T, —T3 — X3. 
. TI <—T,-7T). 

7 Y3 <—T, — Y3. 
. Retum(X3 


: Y3 H Z3). 


on E/K : y? =x3—3x-+b. 


{T, —Zi} 

{Ty <— X1 — Z}} 

{T, —X,+Z7} 

{Ty <—X7— Zi} 

{Tz <— A = 3(X1 — Z})(X1 +Z7)} 
{Y3 <+B= 2Y\} 

{Z3 <— BZ} 

{¥3 <—C = B’} 

{T;3 —D=CX}} 
{¥3<C?} 

{¥3<-C?/2} 

{X3<A*} 

{T,; <—2D} 

[X52 A*=2p} 

{T, — D— X3} 

{T, — (D — X3) A} 

{Y¥3 <—(D— X3)A—C?/2} 


OUTPUT: P+ QO = (X3: Y3: Z3) in Jacobian coordinates. 
If OQ = ow then return(X, : Y; : Z1). 


1. 


10. 


CSN DAKHARWND 


If P = oo then return(x2 : y2: 1). 


. 1, <—Zj. 
. n<T,:Z. 


T, <—T, - x2. 
Tp <—T2- yo. 
T, <—T, — X\. 
In<In- Yj. 


. If T; = 0 then 


{T) -A=Z?7} 
{T,< B= ZA} 
{T, —C = XA} 
{T, <— D = Yr B} 
{T|<E=C-—X}j} 
{T,<F = D-—-Y;} 


9.1 If 72 =0 then use Algorithm 3.21 to compute 
(X3 : Y3: Z3) = 2(x2: y2: 1) and return(X3 : Y3 : Z3). 


9.2 Else return(co). 
23<Z|:T\. 


{Z3 << Z,E} 
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11. 73 <-T?. {T3 <—G = E”} 

12, ye 15-1; {Ty <— H = E>} 

13. 73<T3-X. {73 <I = XG} 

14. T| < 273. {T| <2I} 

15, Xg4=72, {X3< F?} 

16, Xe XQ 7, {Xo<e-F* 97) 

17. Xe Xa= Ty; {X3< F?-(H+2)D} 
18. 73<—T;—-—X3. {T3 <—I — X3} 

19. T3 —T3-T. {T3 <— FU — X3)} 

20. %4<7T4- Yj. {T4<— Y, H} 

21. Y3< 73-74. {Y3< FU — X3)—Y,H} 


22. Return(X3 : Y3: Z3). 


The field operation counts for point addition and doubling in various coordinate 
systems are listed in Table 3.3. The notation C; + Cz — C3 means that the points 
to be added are in C; coordinates and C2 coordinates, while their sum is expressed 
in C3 coordinates; for example, J + A — J is an addition of points in Jacobian and 
affine coordinates, with result in Jacobian coordinates. We see that Jacobian coordi- 
nates yield the fastest point doubling, while mixed Jacobian-affine coordinates yield 
the fastest point addition. Also useful in some point multiplication algorithms (see 
Note 3.43) are mixed Jacobian-Chudnovsky coordinates and mixed Chudnovsky-affine 
coordinates for point addition. 


Doubling | General addition || | Mixed coordinates | 











Table 3.3. Operation counts for point addition and doubling on y? =x? —3x+b. A = affine, 
P = standard projective, J = Jacobian, C = Chudnovsky, I = inversion, M = multiplication, 
S = squaring. 


Repeated doublings 


If consecutive point doublings are to be performed, then Algorithm 3.23 may be slightly 
faster than repeated use of the doubling formula. By working with 2Y until the final 
step, only one division by 2 is required. A field addition in the loop is eliminated by 
calculating 3(X — Z*)(X + Z7) as 3(X” — W), where W = Z?* is computed at the first 
doubling and then updated according to W < WY* before each subsequent doubling. 
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Algorithm 3.23 Repeated point doubling (y?=x?—3x+b, Jacobian coordinates) 


INPUT: P = (X : ¥ : Z) in Jacobian coordinates on E/K : y? = x* —3x +b, and an 
integer m > 0. 
OUTPUT: 2” P in Jacobian coordinates. 
1. If P =o then return(P). 
2¥eo,Wwez, 
3. While m > 0 do: 
3.1 A<—3(X?-—W), B< XY?. 
3.2 X HAP =28, 7 = ZY, 
3.3 m<—m—1.Ifm>O0then W<WY*. 
a4. 4-2A(B =X) Y*. 
4. Return(X, Y/2, Z). 


In m consecutive doublings, Algorithm 3.23 trades m — 1 field additions, m — 1 divi- 
sions by two, and a multiplication for two field squarings (in comparison with repeated 
applications of Algorithm 3.21). The strategy can be adapted to the case where a 4 —3, 
saving two field squarings in each of m — | doublings. 


3.2.3. The elliptic curve y* +xy =x*>+ax*+b 


This subsection considers coordinate systems and addition formulas for the non- 
supersingular elliptic curve E : y* +xy = x* +ax? +b defined over a binary field 
K. Several types of projective coordinates have been proposed. 


1. Standard projective coordinates. Here c = 1 and d = 1. The projective point 
(X :Y:Z), Z #0, corresponds to the affine point (X/Z, Y/Z). The projective 
equation of the elliptic curve is 


Y2Z+XYZ=X>+ax*Z+bZ?. 


The point at infinity oo corresponds to (0: 1 : 0), while the negative of (X : Y : Z) 
is(X:X+Y:Z). 


2. Jacobian projective coordinates. Here c = 2 and d = 3. The projective point 
(X :Y:Z), Z £0, corresponds to the affine point (X/Z*, Y/Z?).The projective 
equation of the elliptic curve is 


Y°E XY ZS] F407" bz, 


The point at infinity oo corresponds to (1: 1 : 0), while the negative of (X : Y : Z) 
is(X:X+Y:Z). 

3. Lépez-Dahab (LD) projective coordinates. Here c = 1 and d = 2. The projec- 
tive point (X : Y : Z), Z £0, corresponds to the affine point (X/Z, Y/Z7). The 
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projective equation of the elliptic curve is 
¥Y°+XYZ=X°Z+aXx°Z?+bZ'. 


The point at infinity oo corresponds to (1 : 0:0), while the negative of (X : Y : Z) 
is (X : X+Y: Z). Formulas for computing the double (X3 : Y3 : Z3) of (X1: 
Y, : Z,) are 


Z3 Xt. 22, X3<—Xi+b-Zj, ¥3<bZ]-Z3+X3-(@Z3+¥?+bZ}). 


Formulas for computing the sum (X3 : Y3 : Z3) of (X1 : Y; : Z,) and (X2: Y2: 1) 
are 
AfY¥p-Z?+¥1, BH Xq-Z,+X1, C—Z,-B, De B?-(C+4Z?), 
Foe, ReHAC, Nee VED, Pet Yy- Zs, 
G<(X2+¥2)-Z3, ¥3<(E+2Z3)-F+G. 





The point doubling and point addition procedures when a ¢€ {0, 1} are given in 
Algorithms 3.24 and 3.25 where an effort was made to minimize the number of 
temporary variables 7;. Theorem 3.18(ii) confirms that the restriction a € {0, 1} 
is without much loss of generality. 





Algorithm 3.24 Point doubling (y?-+-xy=x3-++ax?+b, a€{0, 1}, LD coordinates) 


INPUT: P = (X1: Y; : Z,;) in LD coordinates on E'/K : y?+xy =x>+ax*+b. 
OUTPUT: 2P = (X3: Y3: Z3) in LD coordinates. 





1. If P = oo then return(oo). 

2. T)<—Zj. {T, —Z7} 

3. Tr <—X7. {T, <—X7} 

4, 254-7) Ts, {Z3<—X{Z7} 

5. X3<Tj. {X3<—Xq} 

6. T) <T/. {T; <—Z}} 

7. To2=T) +b, {T, <—bZ}} 

8. X3<-X3+ Tr. {X3<—X}+bZ}} 

9. T)<Y/. {T) -Y7} 

10. Ifa =1 then T; <7, + Z3. {T, —aZ3 +Y?} 

li, T<eaT, + Ts, {Tj —aZ3+Y?+bZ}} 

12, Yo Xo<T, {Y¥3 <— X3(aZ3 + ¥7 +bZ})} 
13. T) <T- Z3. {T; —bZ{Z3} 

14. ¥3<-¥34+7}. {Y¥3 <—bZ}Z3+ X3(aZ3+¥7 +bZ})} 
15. Retum(X3 : Y3 : Z3). 
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Algorithm 3.25 Point addition (y?-+-xy=x>+ax?+b, a€{0, 1}, LD-affine coordinates) 
INPUT: P = (X,: Y; : Z,) in LD coordinates, Q = (x2, y2) in affine coordinates on 


OUTPUT: P+ QO =(X3: 
1. 
. If P =oo then return(x2 : y2: 1). 


10. 
11. 
12. 
13. 
14. 
15. 
16. 
17. 
18. 
19. 
20. 
21. 
22. 
23. 
24. 
25. 
26. 


CA NAKHRWHD 


E/K:y*+xy=x3+ax?+b. 
If OQ = ow then return(P). 
Ti <Z\ *X2. 


Tr <—Z;. 
X3<X,4+7T}. 


_ TY <-—Z,-X3. 


T3 <1: yo. 


.43<-Y,4+T3. 
. If X3 =0 then 


Y3 : Z3) in LD coordinates. 


{T) << X2Z}} 
{T,<—Zi} 

{X3< B= X2Z1 + Xi} 
{T;<C=Z,B} 
{T3 — ¥,Z?} 
{¥3<-A=YoZ7+Vi} 


9.1 If Y3 =0 then use Algorithm 3.24 to compute 


(X3 : Y¥3: Z3) = 2(x2: yo: 1) and return(X3 : 


9.2 Else return(oo). 








Z3<-T?. 

T3 <T, . Y3. 
If a= 1 then 7; <7, + Tr. 
T) — X3. 
X3<T)-T}. 
T) —Y}. 
X3<X34+T). 
X3<— X34 73. 
T> =X) L3;. 
T> <+Th + X3. 
T, —Z}. 
T3<— 134+ Z3. 
¥3<713-T>. 
T2 <—x2+ y2. 
13 <T,-T. 
¥3<¥3+73. 


Return(X3 : Y3 : Z3). 


Y3: Z3). 


{Z3<—C*} 

{T;< E = AC} 

{T| <—C+aZ}} 
{T) < B’} 

{X3<— D = B*(C +aZ;)} 
{T,<A*\ 

{X3< A*+ D} 
{[X¥9<-A*4-D4+ FE} 

{79 <— X223} 

{Ty <— F = X34 X2Z3} 
{T, —Z3} 

{13 <-E+Z3} 

(Ys (E+ Z3)F} 

Lig Xoo} 

{T3 —G = (X2+¥2)Z3} 
{¥34=(£ + Z3)F + G} 








The field operation counts for point addition and doubling in various coordinate 
systems are listed in Table 3.4. 


3.3. Point multiplication 


This section considers methods for computing kP, where k is an integer and P is a 
point on an elliptic curve E defined over a field F,. This operation is called point mul- 
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Coordinate system General addition General addition Doubling 
(mixed coordinates) 
7M 


Affine — V+M 
Standard projective 12M 

Jacobian projective 10M 5M 
Lépez-Dahab projective 8M 4M 





Table 3.4. Operation counts for point addition and doubling on y? +xy = x3 +ax? +b. 
M = multiplication, V = division (see §2.3.6). 


tiplication or scalar multiplication, and dominates the execution time of elliptic curve 
cryptographic schemes (see Chapter 4). The techniques presented do not exploit any 
special structure of the curve. Point multiplication methods that take advantage of ef- 
ficiently computable endomorphisms on some special curves are considered in §3.4, 
§3.5, and §3.6. §3.3.1 covers the case where P is not known a priori. In instances 
where P is fixed, for example in ECDSA signature generation (see §4.4.1), point mul- 
tiplication algorithms can exploit precomputed data that depends only on P (and not on 
k); algorithms of this kind are presented in §3.3.2. Efficient techniques for computing 
kP +1Q are considered in §3.3.3. This operation, called multiple point multiplication, 
dominates the execution time of some elliptic curve cryptographic schemes such as 
ECDSA signature verification (see §4.4.1). 

We will assume that #E (IF) = nh where n is prime and h is small (so n © q), P 
and Q have order n, and multipliers such as k are randomly selected integers from 
the interval [1,7 — 1]. The binary representation of k is denoted (k;_-1,...,k2, ki, ko)2, 
where t © m = [log) q]. 


3.3.1 Unknown point 


Algorithms 3.26 and 3.27 are the additive versions of the basic repeated-square-and- 
multiply methods for exponentiation. Algorithm 3.26 processes the bits of k from right 
to left, while Algorithm 3.27 processes the bits from left to right. 


Algorithm 3.26 Right-to-left binary method for point multiplication 
INPUT: k = (ky-1,...,k1,ko)2, P€ E(F,). 
OUTPUT: KP. 
1. O<o. 
2. For i from 0 to t— 1 do 
2.1 Ifk; =1then O<Q+P. 
2.2 P<2P. 
3. Return(Q). 
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Algorithm 3.27 Left-to-right binary method for point multiplication 
INPUT: k = (ky-1,...,k1,ko)2, P€ E(FQ). 
OUTPUT: kP. 
1. O< oo. 
2. For i from t — 1 downto 0 do 
2.1 O<20Q. 
2.2 Ifk; =1thenO<~QO+P. 
3. Return(Q). 


The expected number of ones in the binary representation of k is t/2 © m/2, whence 
the expected running time of Algorithm 3.27 is approximately m/2 point additions and 
m point doublings, denoted 


m 
aon: (3.15) 


Let M denote a field multiplication, S a field squaring, and / a field inversion. If affine 
coordinates (see §3.1.2) are used, then the running time expressed in terms of field 
operations is 

2.5m$+3mM + 1.5mI (3.16) 


if F, has characteristic > 3, and 
3mM + 1.5mI (3.17) 


if F, is a binary field. 

Suppose that I, has characteristic > 3. If mixed coordinates (see §3.2.2) are used, 
then Q is stored in Jacobian coordinates, while P is stored in affine coordinates. 
Thus the doubling in step 2.1 can be performed using Algorithm 3.21, while the addi- 
tion in step 2.2 can be performed using Algorithm 3.22. The field operation count of 
Algorithm 3.27 is then 


8mM +5.5mS+(01+3M +18) (3.18) 





(one inversion, three multiplications and one squaring are required to convert back to 
affine coordinates). 

Suppose now that F, is a binary field. If mixed coordinates (see §3.2.3) are used, 
then Q is stored in LD projective coordinates, while P can be stored in affine coordi- 
nates. Thus the doubling in step 2.1 can be performed using Algorithm 3.24, and the 
addition in step 2.2 can be performed using Algorithm 3.25. The field operation count 
of Algorithm 3.27 is then 

8.5mM + (2M +11) (3.19) 


(one inversion and two multiplications are required to convert back to affine 
coordinates). 
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Non-adjacent form (NAF) 


If P= (x, y) € E(F,) then —P = (x,x + y) if Fy is a binary field, and —P = (x, —y) 
if F, has characteristic > 3. Thus subtraction of points on an elliptic curve is just as 
efficient as addition. This motivates using a signed digit representation k = ee, k;2', 
where k; € {0, +1}. A particularly useful signed digit representation is the non-adjacent 
form (NAF). 


Definition 3.28 A non-adjacent form (NAF) of a positive integer k is an expression 
k= ke where k; € {0, +1}, k;-1 40, and no two consecutive digits kj; are 
nonzero. The length of the NAF is /. 


Theorem 3.29 (properties of NAFs) Let k be a positive integer. 
(i) k has a unique NAF denoted NAF(&). 
(ii) NAF(«) has the fewest nonzero digits of any signed digit representation of k. 


(iii) The length of NAF(xk) is at most one more than the length of the binary 
representation of k. 


(iv) If the length of NAF(k) is /, then 2!/3 < k < 2/+!/3. 


(v) The average density of nonzero digits among all NAFs of length / is approxi- 
mately 1/3. 


NAF(&) can be efficiently computed using Algorithm 3.30. The digits of NAF(k) are 
generated by repeatedly dividing k by 2, allowing remainders of 0 or +1. If k is odd, 
then the remainder r € {—1, 1} is chosen so that the quotient (k —r)/2 is even—this 
ensures that the next NAF digit is 0. 


Algorithm 3.30 Computing the NAF of a positive integer 


INPUT: A positive integer k. 
OUTPUT: NAF(k). 
1. i<0. 
2. While k > 1 do 
2.1 If k is odd then: kj <2 —(k mod 4), k<k—k;; 
2.2 Else: k; <0. 
2.3 k<—k/2,i<itl. 
3. Return(kj_1, kj—2,...,k1, ko). 


Algorithm 3.31 modifies the left-to-right binary method for point multiplication (Al- 
gorithm 3.27) by using NAF(k) instead of the binary representation of k. It follows 
from (iii) and (v) of Theorem 3.29 that the expected running time of Algorithm 3.31 is 
approximately 


m 
gn (3.20) 
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Algorithm 3.31 Binary NAF method for point multiplication 


INPUT: Positive integer k, P € E(F). 
OUTPUT: KP. 
1. Use Algorithm 3.30 to compute NAF(k) = So) ki2!. 
2. 0<—M. 
3. For i from /— 1 downto 0 do 
3.1 O< 20. 
3.2 Ifkj =1 then O<Q+P. 
3.3 Ifk; =—1 then O<OQO-P. 
4. Return(Q). 


Window methods 


If some extra memory is available, the running time of Algorithm 3.31 can be decreased 
by using a window method which processes w digits of k at a time. 


Definition 3.32 Let w > 2 be a positive integer. A width-w NAF of a positive integer k 
is an expression k = | k;2' where each nonzero coefficient k; is odd, |k;| <2”~!, 
kj_-1 #0, and at most one of any w consecutive digits is nonzero. The length of the 
width-w NAF is /. 


Theorem 3.33 (properties of width-w NAFs) Let k be a positive integer. 
(i) k has a unique width-w NAF denoted NAF, (4). 
(it) NAFo(k) = NAF(&). 


(iii) The length of NAF,,(k) is at most one more than the length of the binary 
representation of k. 


(iv) The average density of nonzero digits among all width-w NAFs of length / is 
approximately 1/(w + 1). 


Example 3.34 (width-w NAFs) Let k = 1122334455. We denote a negative integer —c 
by c. The binary representation of k and the width-w NAFs of k for 2 < w < 6 are: 


()2= 1000 01011 100101 0111011011110111 
NAF2(k) = 1000 10100 T010T0O TO00TOOTOOOOTOOTI 
NAF3(k) = 1000 00300 100100 30001T100T10000T001 
NAFa(k)= 1000 01000 700005 0007000700010007 
NAF5(k) = 10000150000 9VDO00001IL0000009N00000009 
NAF6(k)= 1000 00000230000011000000900000009 


NAF,, (k) can be efficiently computed using Algorithm 3.35, where k mods 2” de- 
notes the integer u satisfying uw =k (mod 2”) and —2’~! <u < 2”~!. The digits 
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of NAF, (k) are obtained by repeatedly dividing k by 2, allowing remainders r in 
[—2’-! 2¥-!_ 1]. If k is odd and the remainder r = k mods 2” is chosen, then 
(k —r)/2 will be divisible by 2”—!, ensuring that the next w — 1 digits are zero. 


Algorithm 3.35 Computing the width-w NAF of a positive integer 


INPUT: Window width w, positive integer k. 
OUTPUT: NAF,, (k). 
1. i<0. 
2. While k > 1 do 
2.1 If k is odd then: k; <k mods 2”, k —k —k;; 
2.2 Else: k; <0. 
2.3 k<ek/2,i<itl. 
3. Return(k;_1, kj—2,...,k1, ko). 


Algorithm 3.36 generalizes the binary NAF method (Algorithm 3.31) by using 
NAF,,(k) instead of NAF(k). If follows from (iii) and (iv) of Theorem 3.33 that the 
expected running time of Algorithm 3.36 is approximately 





[i (ge? DA] [atmo]. (3.21) 
W 


Algorithm 3.36 Window NAF method for point multiplication 


INPUT: Window width w, positive integer k, P € E(F,). 
OUTPUT: KP. 
1. Use Algorithm 3.35 to compute NAF, (k) = Sea: kj2!, 
2. Compute P;} =i P fori € Ne Se ce 
3. O<~. 
4. For i from / — 1 downto 0 do 
41 QO<20. 
4.2 If kj; £0 then: 
If kj; > Othen O<QO+ Py; 
Else O <— Q— P_x,. 
5. Return(Q). 


Note 3.37 (selection of coordinates) The number of field inversions required can be 
reduced by use of projective coordinates for the accumulator Q. If inversion is suffi- 
ciently expensive relative to field multiplication, then projective coordinates may also 
be effective for P;. Chudnovsky coordinates (§3.2.2) for curves over prime fields elim- 
inate inversions in precomputation at the cost of less-efficient Jacobian-Chudnovsky 
mixed additions in the evaluation phase. 
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The window NAF method employs a “sliding window” in the sense that Algorithm 
3.35 has a width-w window, moving right-to-left, skipping consecutive zero entries 
after a nonzero digit k; is processed. As an alternative, a sliding window can be used 
on the NAF of k, leading to Algorithm 3.38. The window (which has width at most w) 
moves left-to-right over the digits in NAF(k), with placement so that the value in the 
window is odd (to reduce the required precomputation). 


Algorithm 3.38 Sliding window method for point multiplication 


INPUT: Window width w, positive integer k, P € E(F,). 
OUTPUT: kP. 
1. Use Algorithm 3.30 to compute NAF(k) =“) ki2!. 
2. Compute P; = iP fori € {1,3,...,2(2” —(—1)”)/3— 1}. 
3. O<w,i<Il—-1. 
4. While i > 0 do 
4.1 Ifk; =Othent <1,u<0; 
4.2 Else: find the largest t < w such that u <— (kj,..., kj-+41) is odd. 
43 O<2'0O. 
4.4 Ifu>Othen O<~ O+ P,; else if u <0 then O<QO- P_y. 
45 i<i-t. 
5. Return(Q). 


The average length of a run of zeros between windows in the sliding window method 
is 





tui) ee oO 
v(w) = = 
WS 3 3. Qw=2 
It follows that the expected running time of Algorithm 3.38 is approximately 
2¥—(-1)” m 
1D + | ———— —- 1] A] + ——A4 mD. (3.22) 
3 w+v(w) 


Note 3.39 (comparing sliding window and window NAF methods) For a given w, the 
sliding window method allows larger values in a window compared with those appear- 
ing in a width-w NAF. This translates to a higher cost for precomputation (roughly 
2”/3 in step 2 of Algorithm 3.38 versus 2”/4 point operations in step 2 of Algo- 
rithm 3.36) in the sliding window method, but fewer point operations in the main loop 
(m/(w+v(w)) versus m/(w + 1)). If the comparison is on point operations, then the 
window NAF method will usually result in fewer point additions (when the optimum w 
is selected for each method) for m of interest. To make a more precise comparison, the 
coordinate representations (driven by the cost of field inversion versus multiplication) 
must be considered. 

As an example, consider the NIST binary curves and suppose that the inverse to mul- 
tiplication ratio is //M = 8. Affine coordinates are used in precomputation, while the 
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[points [Pm = 163 | m= 333 | mae | m=O | masT | 
[WN SWI[WN_SW]WN_SW[WN_SW[WN_SW[WN_ SW] 





2 


Table 3.5. Point addition cost in sliding versus window NAF methods, when I/M = 8. “points 
denotes the number the points stored in the precomputation stage. “WN” denotes the window 
NAF method (Algorithm 3.36). “SW” denotes the sliding window method (Algorithm 3.38). 


main loop uses mixed projective-affine additions. Table 3.5 shows the expected cost 
of point additions in each method. Note that there will also be m point doublings with 
each method, so the difference in times for point multiplication will be even smaller 
than Table 3.5 suggests. If there are constraints on the number of points that can be 
stored at the precomputation phase, then the difference in precomputation may decide 
the best method. For example, if only three points can be stored, then the sliding win- 
dow method will be preferred, while storage for four points will favour the window 
NAF method. The differences are fairly small however; in the example, use of w = 3 
(two and three points of precomputation, respectively) for both methods will favour 
sliding window, but gives only 7—-10% reduction in point addition cost over window 
NAF. 


Montgomery’s method 


Algorithm 3.40 for non-supersingular elliptic curves y? + xy = x3 + ax? +b over 
binary fields is due to Lopez and Dahab, and is based on an idea of Montgomery. 


Let Q; = (x1, y1) and Q2 = (x2, y2) with Q; A +Q>. Let Q) + Q2 = (x3, y3) and 
QO; — Q2 = (x4, y4). Then using the addition formulas (see §3.1.2), it can be verified 


that A 
nce ( ti ) (3.23) 
Xj +X2 xX, +Xx2 


Thus, the x-coordinate of Q; + Q2 can be computed from the x-coordinates of Q1, 
Q»2 and Q; — Qz». Iteration j of Algorithm 3.40 for determining k P computes the x- 
coordinates only of 7; = [/P,(/+1)P], where / is the integer represented by the j 
leftmost bits of k. Then Tj.) = [2/P, (22+ 1)P] or [(2/ + 1)P, (21 + 2)P] if the (j + 
1)st leftmost bit of k is 0 or 1, respectively, as illustrated in Figure 3.4. Each iteration 
requires one doubling and one addition using (3.23). After the last iteration, having 
computed the x-coordinates of k P = (x;, y,) and (k+1)P = (42, yo), the y-coordinate 
of kP can be recovered as: 








X3=X4+ 





yi =x 11 +x) +x)Q@2tx)+x7+yl+y. (3.24) 
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(Kr—1kt—2 +++ kt—j ket) kx-(j +2) ++ kiko) 2P 
———— eee 


+ 1 
(UP, (+1)P] >| [20P,/P+( +) PI, if kis 
[P+ (+1) P,20+ 1 PI, if eo 









Figure 3.4. One iteration in Montgomery point multiplication. After j iterations, the 
x-coordinates of [P and (1+ 1)P are known for | = (ky, ---k;—j)2. Iteration j + 1 re- 
quires a doubling and an addition to find the x-coordinates of I'P and (I! + 1)P_ for 
l! = (ky_1-+-kt_-(ja)2- 


Equation (3.24) is derived using the addition formula for computing the x-coordinate 
x2 of (k+1)P from kP = (x1, yj) and P= (x,y). 

Algorithm 3.40 is presented using standard projective coordinates (see §3.2.1); only 
the X- and Z-coordinates of points are computed in steps | and 2. The approximate 
running time is 

6mM + (17 +10M). (3.25) 


One advantage of Algorithm 3.40 is that it does not have any extra storage require- 
ments. Another advantage is that the same operations are performed in every iteration 
of the main loop, thereby potentially increasing resistance to timing attacks and power 
analysis attacks (cf. §5.3). 


Algorithm 3.40 Montgomery point multiplication (for elliptic curves over F”) 
INPUT: k = (ky_1,..., 41, ko)2 with kk; = 1, P=(x, y) € E(Fom). 
OUTPUT: kP. 
l, Xp ee Qe, Meee +b Bee. {Compute (P,2P)} 
2. For i from t — 2 downto 0 do 
2.1 If k; = 1 then 
T<Z,,2Z,;<—(X1Zo.+ Xo 71). X,<—xZ, +X1X2T Zo. 
T — Xp, X2<— X}+bZ}3, Z. —T*Z3. 





2.2 Else 








T <Z), 22 <— (X{ Zo i X2Z1)*, X2 <XxZ22 +X,X2Z) i 

T —X1, X1 —X}+bZ}, Z|) —T°Z}. 
3. x13<-X1/Z}. 
© ¥3 = (X+X 1 /Z p(X 1 +XZ1)(X24-xZ2) +(e? +y)(Z1 Z2)\(%Z1Z2) | +-y. 
5. Return(x3, y3). 





a 


3.3.2 Fixed point 


If the point P is fixed and some storage is available, then the point multiplication 
operation k P can be accelerated by precomputing some data that depends only on P. 
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For example, if the points 2P, 2?P,...,2'-'P are precomputed, then the right-to-left 
binary method (Algorithm 3.26) has expected running time (m/2)A (all doublings are 
eliminated).! 


Fixed-base windowing methods 


Brickell, Gordon, McCurley and Wilson proposed the following refinement to the sim- 
ple method of precomputing every multiple 2’ P. Let (Kg_1,...,K1, Ko)ow be the 
base-2” representation of k, where d = [t/w], and let Q; = nee} 2”! P for each j, 
1<j <2” —1. Then 
d-1 2M—1 21 
KP = Kar) = (7 YP) = DY 0, 
i=0 j=l iKj;=j j=l 
= Qow_1 + (Q2w-1 + Qaw-2) +++» + (Qaw-1 + Qow_2 ++ +++ Q1). 
Algorithm 3.41 is based on this observation. Its expected running time is approximately 
Q"+d—3)A (3.26) 


where d = [t/w] andt +m. 


Algorithm 3.41 Fixed-base windowing method for point multiplication 
INPUT: Window width w, d = [t/w],k = (Kg-1,..., K1, Ko)2v, P€ E(FQ). 
OUTPUT: kP. 
1. Precomputation. Compute P; = 2 P O=i<d—I, 
2. Aw, B<ow. 
3. For j from 2” — 1 downto | do 
3.1 For each i for which Kj; = j do: B<-B+P;. {Add Q; to B} 
3.2 A<A+B. 
4. Return(A). 


Algorithm 3.42 modifies Algorithm 3.41 by using NAF(k) instead of the binary 
representation of k. In Algorithm 3.42, NAF(k) is divided into {0, +1}-strings K; each 
of the same length w: 

NAF(k) = Ka-1 || «++ || Kill Ko. 
Since each K; is in non-adjacent form, it represents an integer in the interval [—/, /] 
where J = (2+! — 2)/3 if w is even, and J = (2”*+! — 1)/3 if w is odd. The expected 
running time of Algorithm 3.42 is approximately 


qu+l 
( ; +d-2)A (3.27) 





where d = [(t+1)/w]. 


'Recall the following notation: ¢ is the bitlength of k, and m = [logy g]. Also, we assume that ¢ © m. 
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Algorithm 3.42 Fixed-base NAF windowing method for point multiplication 


INPUT: Window width w, positive integer k, P € E(F,). 
OUTPUT: kP. 
1. Precomputation. Compute P; = IP Oei <lr+ 1)/w}. 
2. Use Algorithm 3.30 to compute NAF(k) = )-)2g ki2!. 
3. d<[l/w]. 
4. By padding NAF(k) on the left with Os if necessary, write (kj_-1,...,k1, ko) = 
Kq_||--- || K1 || Ko where each K; is a {0, +1}-string of length d. 
. If w is even then J < (2+! — 2) /3; else 1 — (2+! — 1)/3. 
_ Ao, B<o. 
7. For j from J downto | do 
7.1 For each i for which K; = j do: B<-B+P;. {Add Q; to B} 
7.2 For each i for which K; = —j do: B<-B—P;. {Add —Q; to B} 
73 A<A+B. 
8. Return(A). 


nN 


Note 3.43 (selection of coordinates) If field inversion is sufficiently expensive, then 
projective coordinates will be preferred for one or both of the accumulators A and 
B in Algorithms 3.41 and 3.42. In the case of curves over prime fields, Table 3.3 shows 
that Chudnovsky coordinates for B and Jacobian coordinates for A is the preferred 
selection if projective coordinates are used, in which case Algorithm 3.42 has mixed 
Chudnovsky-affine additions at steps 7.1 and 7.2, and mixed Jacobian-Chudnovsky 
addition at step 7.3. 


Fixed-base comb methods 


Let d = [t/w]. In the fixed-base comb method (Algorithm 3.44), the binary represen- 
tation of k is first padded on the left with dw —t Os, and is then divided into w bit 
strings each of the same length d so that 


k=KY']-- KEYS. 


The bit strings K/ are written as rows of an exponent array 


| Ko | ca rcs K ‘| ka-1 oes ko 


| Kw | KY , ade ae bone aaj Kwa 


- o : 3 
Ea | Leal Lk kwd-1 Hist 








B 
— 
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whose columns are then processed one at a time. In order to accelerate the computation, 
the points 


[w—1,.--, 42,41, a9] P = dy—12™~ D4 P +--+ an24 P+.a)24P +.a0P 


are precomputed for all possible bit strings (dy—1,...,@1, a0). 


Algorithm 3.44 Fixed-base comb method for point multiplication 
INPUT: Window width w, d = [t/w], k = (ky_-1,...,k1,ko)2, P € E(Fq). 
OUTPUT: kP. 
1. Precomputation. Compute [dy_1,...,4,,d9]P for all bit strings (ay_,..., 
a1, a0) of length w. 
2. By padding k on the left with Os if necessary, write k = Re | (Rk, 
where each K/ is a bit string of length d. Let k} denote the ith bit of K/. 


3. O< ~H. 
4. For i from d — 1 downto 0 do 

41 Q0<20. 

42 O<Q+[K”",...,K}, KS 1P. 
5. Return(Q). 


The expected running time of Algorithm 3.44 is 





2” —1 
( om a—1)A+(d—-Db. (3.28) 

For w > 2, Algorithm 3.44 has approximately the same number of point additions 
as point doubles in the main loop. Figure 3.5 illustrates the use of a second table of 
precomputation in Algorithm 3.45, leading to roughly half as many point doubles as 
point additions. 


Algorithm 3.45 Fixed-base comb method (with two tables) for point multiplication 
INPUT: Window width w, d = [t/w], e = [d/2],k = (ky_-1,...,ko)2, P € E(Fy). 
OUTPUT: kP. 


1. Precomputation. Compute [dy—1,...,@1,d0]P and 2°[dy_1,...,@1,a0]P for all 
bit strings (dw—1,..., 41, a0) of length w. 
2. By padding k on the left with Os if necessary, write k = KY | ese || KL 
where each K/ is a bit string of length d. Let K : denote the ith bit of K/. 
3. O< ~H. 
4. For i from e — | downto 0 do 
41 O<2Q. 


-1 1 70 =] 1 0 
4.2 O<Q+[K;” jong hee ee, peony KG Ke IP. 
5. Return(Q). 
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w xd exponent array 


0 0 
Ka-1 Ko 

w-1 w-1 
Ky » Ko 














Precomp 
—— 1 elements) 


Precomp 
(2”—1 elements) 


2°[dw—1,---,4Q]P 


O—204+2°K',..., Ko IP + (KP! ..., KP 


Figure 3.5. One iteration in Algorithm 3.45. The w xd exponent array is processed left-to-right in 
e = [d/2] iterations to find k P. Precomputation finds [ay_—1,...,ag]P and 2°[ay_1,...,a9]P 
for all w-bit values (a1, ..., a9), Where [ay _1, -.., 49] = dy —12~ D4 + --- 4.4424 +p. 


The expected running time of Algorithm 3.45 is approximately 


2”-1 
ele A+(e—I)D. (3.29) 
For a fixed w, Algorithm 3.45 requires twice as much storage for precomputation as 
Algorithm 3.44. For a given amount of precomputation, Algorithm 3.45 is expected to 
outperform Algorithm 3.44 whenever 


2°-lw—2) A 

——————— > —, 

2¥—w-1 ~ D 
where w is the window width used in Algorithm 3.44 (and hence width w — | is used 
with Algorithm 3.45). As an example, LD coordinates in the binary field case give 
A/D * 2, requiring (roughly) w > 6 in Algorithm 3.44 in order for the two-table 
method to be superior. For the NIST curves over prime fields, A/ D © 1.4 with Jacobian 
coordinates and S = .8M, requiring w > 4. 


Note 3.46 (Algorithm 3.45 with simultaneous addition) If storage for an additional e 
points (which depend on k) can be tolerated, then the values 


T; -(K?',...,K}, KPIP+2°( Ks, ..., K} ]P, O0<i<e, 


ae ‘ >““i+e Ka 


at step 4.2 of Algorithm 3.45 can be determined in a (data-dependent) precomputation 
phase. The strategy calculates the points 7; in affine coordinates, using the method of 
simultaneous inversion (Algorithm 2.26) to replace an expected e’ = (1—1 /2")7e field 
inverses with one inverse and 3(e’ — 1) field multiplications. 
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If Q is maintained in projective coordinates, then e’ mixed-coordinate additions at 
step 4.2 are replaced by e’ simultaneous additions in the new precomputation phase. 
With the coordinates discussed in §3.2, this translates into the following approximate 
field operation counts. 


e’ additions in E(F2”) e’ additions in E(F py”), p > 3 


mixed-coordinate simultaneous | mixed-coordinate simultaneous 


8e’M I+ (Se’—3)M 8e'M +3e’S 1+ (Se'—-3)M+e'S 


For curves of practical interest from §3.2 over fields where //M is expected to be small 
(e.g., binary fields and OEFs), a roughly 15% reduction in point multiplication time is 
predicted. 





Note 3.47 (comb methods) Algorithms 3.44 and 3.45 are special cases of exponen- 
tiation methods due to Lim and Lee. For given parameters w and v, a t-bit integer 
k is written as an exponent array of wxd bits where d = [t/w], as illustrated 
in Figure 3.6. A typical entry Ky, ”" consists of the e = [d/v] bits of k given by 
ere = (Ki4e-1,--->kt41,k1) ehiete 1 =dw’' + ev’ (with zeros replacing some entries 
if v’ = v—1 and v{d). 


0 0 0 

Ky _ (KY, ,e— press Kyo) _ Ko 

wy Ky, (Ky en Kyo) - Ky 
v-1 _ v’,e— ter v0) = 0 


e=[d/v] bits 
i! 
d=[t/w] bits 
Figure 3.6. The exponent array in Lim-Lee combing methods. Given parameters w and v, a t-bit 
integer k is written as a w xd bit array for d = [t/w]. Entries ag have e = [d/v| bits. 


If K”’ denotes the integer formed from the bits in row w’, then 
w-l w- 
Ray RYO PS be ip coe aa P 

w’=0 
w-lv-l 

=) (5 KY et av p 
w’=0v'=0 

v-1 
- “rr rs ry 27) 


w’/=0 





Plu’ IK yr] 
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where v(2” — 1) points P[v’][u] for v’ € [0, v—1] and u € [1, 2” —1] are precomputed. 
A point multiplication algorithm based on this method is expected to require approxi- 
mately e— 1 * ~~ — | point doublings and (4 =i point additions. Algorithms 





Qw 
3.44 and 3.45 are the cases v = | and v = 2, respectively. 


3.3.3 Multiple point multiplication 


One method to potentially speed the computation of k P +/@Q is simultaneous multiple 
point multiplication (Algorithm 3.48), also known as Shamir’s trick. If k and J are t-bit 
numbers, then their binary representations are written in a 2 xf matrix known as the 
exponent array. Given width w, the values iP + 7 Q are calculated for 0 <i, j < 2”. 
At each of [t/w] steps, the accumulator receives w doublings and an addition from the 
table of values i P + j Q determined by the contents of a 2x w window passed over the 
exponent array; see Figure 3.7. 






































w bits 
kP= |K@—!|... | Ki |... | K° |p Precomputation 
PO = EF? ere P| See | 8 (2?” — 1 points) 
OP+1 
lookup . Q 
R<—2”YR+(K'P+L'Q) (2”—1)P+(2¥-1)Q 











Figure 3.7. Simultaneous point multiplication accumulation step. 


Algorithm 3.48 has an expected running time of approximately 


2w 


[a-220-0 oY l= )Ato -2"-p| as (= by 1)A+(d—1wD | 


22w 
(3.30) 


Q2w 


and requires storage for — | points. 


Algorithm 3.48 Simultaneous multiple point multiplication 


INPUT: Window width w, k = (k;_1,...,ko)2,/ = (y-1,...,lo)2, P, O€ E(F,). 
Output: kP+1Q. 
1. Compute iP + j Q for all i, 7 € [0,2” — 1]. 
2. Write k = (K4~!,..., K!, K®) and/ = (L@~!,..., L!, L®) where each K', L' is 
a bitstring of length w, and d = [t/w]. 
. Reo. 
4. For i from d— | downto 0 do 
4.1 R<—2”R. 
A? Re R+ (KR P+ O). 
5. Return(R). 


ies) 
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Algorithm 3.48 can be improved by use of a sliding window. At each step, place- 
ment of a window of width at most w is such that the right-most column is nonzero. 
Precomputation storage is reduced by 27") — 1 points. The improved algorithm is 
expected to have t/(w + (1/3)) point additions in the evaluation stage, a savings of 
approximately 9% (in evaluation stage additions) compared with Algorithm 3.48 for 
w € {2, 3}. 


Joint sparse form 


If k and / are each written in NAF form, then the expected number of zero columns in 
the exponent array increases, so that the expected number of additions in the evaluation 
stage of a suitably modified Algorithm 3.48 (processing one column at a time) is 51/9. 
The expected number of zero columns can be increased by choosing signed binary 
expansions of k and / jointly. The joint sparse form (JSF) exponent array of positive 
integers k and / is characterized by the following properties. 


1. At least one of any three consecutive columns is zero. 
2. Consecutive terms in a row do not have opposite signs. 


3. Ifkj41k; #0 thenl;,; AO and/; =0. Ifl;411; AO then kj,; AO and k; =0. 


The representation has minimal weight among all joint signed binary expansions, 
where the weight is defined to be the number of nonzero columns. 


Example 3.49 (joint sparse form) The following table gives exponent arrays for k = 
53 and / = 102. 


sing RAF Toit par fom 


se 100-10 -1 -1 
1 


110 10-1 O 
5 





If Algorithm 3.48 is modified to use JSF, processing a single column in each itera- 
tion, then ¢/2 additions (rather than 5t/9 using NAFs) are required in the evaluation 
stage. Algorithm 3.50 finds the joint sparse form for integers k! and k?. Although it 
is written in terms of integer operations, in fact only simple bit arithmetic is required; 
for example, evaluation modulo 8 means that three bits must be examined, and [ki /2| 
discards the rightmost bit. 
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Algorithm 3.50 Joint sparse form 


INPUT: Nonnegative integers k! and k*, not both zero. 
OUTPUT: JSF(k?, k!), the joint sparse form of k! and k?. 
1. l< 0, d, < 0, do < 0. 
2. While (k! +d, > 0 or k* +d) > 0) do 
2.1 €)<d,4 k}, fy < do 4 k 
2.2 For i from | to 2 do 
If 2; is even then u <0; 
Else 
u<€; mods 4. 
If 2; = +3 (mod 8) and £3_; =2 (mod 4) then u <— —u. 
ki <u. 
2.3 For i from 1| to 2 do 
If 2d; = 1+; then dj —1—dj. 








ki [kE/2). 
24 1<14+1. 
Kb syenagike 
3. Retum SSFP, k!) = ( i , 0 ). 
Kjy_ys +++ Ko 


Interleaving 


The simultaneous and comb methods process multiple point multiplications using 
precomputation involving combinations of the points. Roughly speaking, if each pre- 
computed value involves only a single point, then the associated method is known as 
interleaving. 

In the calculation of )*k/ P; for points P; and integers k/, interleaving allows dif- 
ferent methods to be used for each k/ P;, provided that the doubling step can be done 
jointly. For example, width-w NAF methods with different widths can be used, or some 
point multiplications may be done by comb methods. However, the cost of the doubling 
is determined by the maximum number of doublings required in the methods for k/ P as 
and hence the benefits of a comb method may be lost in interleaving. 

Algorithm 3.51 is an interleaving method for computing pe i=| ki P;, where a width- 
w; NAF is used on ki, Points i P; for oddi < 2”i—! are éaleulaieds in a precomputation 
phase. The expansions NAF,, (ki) are processed jointly, left to right, with a single 
doubling of the accumulator at each stage; Figure 3.8 illustrates the case v = 2. The 
algorithm has an expected running time of approximately 


Vv Vv 
l 
[ee > 21D+ rar? al +] max D+) > j 
<Jj<vu jal Wj 


j=l 





-4| (3.31) 


where /; denotes the length of NAF, (k/), and requires storage for Vi-1 2”i-? points. 
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NAF; (k/) 
PP = ki kj P Precomputation 
Py =| 2 |p, | (217242427? points) 
v IP; TP 
lookup 3Pi 3P2 
1 : : 
Q<—20+k! Pi +k? Py (2¥1-!_1) P, | (2¥2-!-1)P, 














Figure 3.8. Computing k! P; +k? Py using interleaving with NAFs. The point multiplication 
accumulation step is shown for the case v = 2 points. Scalar k/ is written in width-w ; NAF 
form. 


Algorithm 3.51 Interleaving with NAFs 
INPUT: v, integers k/, widths w,; and points Pj, 1 <j <v. 
OUTPUT: )0j_1k! Pj 
1. Compute iP; fori € {1,3,...,2%7-!-l}, l<j<v. 
. Use Algorithm 3.30 to compute NAF,, ,(k/) = ya i k! Pola; <9, 
. Let? = max{lj :1< j < v}. 
' Define k/ =Oforl; <i</,l<j<v. 
Q<o. 
. For i from / — 1 downto 0 do 
6.1 O< 20. 
6.2 For J from | to v do 
If ki #0 then 
If k) > 0 then O<-O+k Pj; 
Else O — O—k/ Pj. 
7. Return(Q). 


AnNKR wh 


Note 3.52 (comparison with simultaneous methods) Consider the calculation of kP + 
1Q, where k and / are approximately the same bitlength. The simultaneous sliding and 
interleaving methods require essentially the same number of point doublings regardless 
of the window widths. For a given w, simultaneous sliding requires 3-27“ —" points 
of storage, and approximately t/(w + (1/3)) point additions in the evaluation stage, 
while interleaving with width 2w + 1 on k and width 2w on/ requires the same amount 
of storage, but only (4w + 3)t/(4w +5w+2) <t/(w+(1/2)) additions in evalua- 
tion. Interleaving may also be preferable at the precomputation phase, since operations 
involving a known point P may be done in advance (encouraging the use of a wider 
width for NAF,, (k)), in contrast to the joint computations required in the simultaneous 
method. Table 3.6 compares operation counts for computing k P +/@Q in the case that 
P (but not Q) is known in advance. 
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In the case that storage for precomputation is limited to four points (including P 
and Q), interleaving with width-3 NAFs or use of the JSF give essentially the same 
performance, with interleaving requiring one or two more point doublings at the pre- 
computation stage. Table 3.6 gives some comparative results for small window sizes. 








method w_ storage additions doubles 
Alg 3.48 1 3 14+3t/4~1+.75t t 
Alg 3.48 2 1S) 9415t/32~9+.47t 241 
Alg 3.48 with sliding 2 12 9+3t/7~9+ .43t 2+1 
Alg 3.48 with NAF 4 2+5t/9 2+ .56t t 
Alg 3.48 with JSF 4 2+t/2%2+.5t t 
interleave with 3-NAF 3,3 242 1+1¢/2~1+.5t 1+t 
interleave with 5-NAF & 4-NAF 5,4 844 3411¢/30%3+4+.37t 1+t¢ 








Table 3.6. Approximate operation counts for computing k P +1Q, where k and 1 are t -bit integers. 
The precomputation involving only P is excluded. 


Interleaving can be considered as an alternative to the comb method (Algo- 
rithm 3.44) for computing &P. In this case, the exponent array for k is processed using 
interleaving (Algorithm 3.51), with k/ given by k = ae ki2U-)4 and points P; 
given by P; = 2U-Dd p, 1 < j < w, where d is defined in Algorithm 3.44. Table 3.7 
compares the comb and interleaving methods for fixed storage. 


method rows _ storage additions doubles 
comb 2 3 3t/8 © .38t t/2 
interleave (3, 3) 2 4 t/4% .25t t/2 
comb 4 15 15t/64 © .23t t/4 
comb (two-table) 3 14 7t/24 % .29t t/6 
interleave (4, 4, 4, 4) 4 16 t/4 .25t t/4 
interleave (4, 4, 4, 3, 3) 5 16 11¢/50 © .22t t/5 
comb 5 31 31t/160 © .19¢t t/5 
comb (two-table) 4 30 15t/64 © .23¢ t/8 
interleave (5, 5,5, 4, 4) 5 32 9t/50 © .18t t/5 


interleave (5,5, 4, 4, 4, 4) 


a 


32 17t/90 © .19¢ t/6 


Table 3.7. Approximate operation counts in comb and interleaving methods for computing k P,, 
P known in advance. The bitlength of k is denoted by t. The interleaving methods list the widths 
used on each row in calculating the NAF: 
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3.4 Koblitz curves 


Koblitz curves, also known as anomalous binary curves, are elliptic curves defined over 
F 2. The primary advantage of these curves is that point multiplication algorithms can be 
devised that do not use any point doublings. The material in this section is closely based 
on the detailed paper by Solinas [446] which contains proofs of facts and analyses of 
algorithms presented. 


Definition 3.53 Koblitz curves are the following elliptic curves defined over F2: 


Eo: y+xy=xrtl 











E\: y+xy=x? x7 ll, 





In cryptographic protocols, one uses the group Eo(F2”) or Ey (F2~) of Fo~-rational 
points for some extension field Fy”. Let a € {0,1}. For each proper divisor / of m, 
Eq(F2!) is a subgroup of Eg (F2) and hence #£, (F2!) divides #E, (F 2”). In particular, 
since #Eo(F2) = 4 and #£; (F2) = 2, #Eo(F2”) is a multiple of 4 and #£)(F2”) is a 
multiple of 2. 


Definition 3.54 A Koblitz curve E, has almost-prime group order over Fm if 
#Eq(F2") = hn where n is prime and 


h is called the cofactor. 


We shall assume throughout the remainder of this section that Ey is a Koblitz curve 
with almost-prime group order #£,(F2”). Observe that #E,(F2”) can only be almost 
prime if m is a prime number. The group orders #E, (F2”) can be efficiently computed 
using Theorem 3.11. Table 3.8 lists the extension degrees m € [100, 600], and Koblitz 
curves E, for which #E, (F2”) is almost prime. 


3.4.1 The Frobenius map and the ring Z[t | 


Definition 3.55 Let E, be a Koblitz curve. The Frobenius map t : Eg(F2) > 
Eq(F2™) is defined by 


t(oo) =00, T(x, y) =(x?, y’). 


The Frobenius map can be efficiently computed since squaring in F'~ is relatively 
inexpensive (see §2.3.4). It is known that 


(17 +2)P =pt(P) forall Pe Eq(F2"), 
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- 1267650600228230886142808508011 
-2535301200456459535862530067069 

- 405648 19207303335604363489037809 

- 81129638414606692182851032212511 

- 324518553658426701487448656461467 

- 5192296858534827627896703833467507 

- 680564733841876926932320 129493409985 129 

- 5846006549323611672814741753598448348329 1 18574063 

- 3450873 17339528 1893717377931 1385 12760570940988862252126\ 
328087024741343 
-22085588309729804 1 1979121875928648 149482 165613217098488\, 
87480219215362213 

- 60708402882054033466233 184588234965832575 1 1049878650876\ 
4884175561891622165064650683 

- 388533778445 1458 1418389238 136470378 132848 1 1733793061324\ 
2958749975298 15829704422603873 

- 77106755689029 1628367784762729407562656963 1244830993521 )\ 
42274928285 1602622232822777663 

- 20859248397665 1375233888838493 12032369 16703635071711166\ 
7398912185849 16354726654294825338302183 

+ 218725072478301 1924372502227 1 17621365353 169430893227643\ 
4470103067 11358712586776588594343505255614303 

- 1433436634993794694756763059563804337997853 1 18230175657 \ 
285374203072407638033257741 15493723 193900257029311 
-28668732699875893895 135261 19127608675995706236460351478\, 
840674433541530787625 1189903596065 1549018775044323 

- 5871356456934583069723701491973342568439206372270799668\ 
110818246094859 17244 124494882365 172478748 165648998663 

- 330527984395 1242994759576540163855199142023414821406096\, 
423243950228807 11289249 1910506732584577774580 1409636659 \, 
0617731358671 

- 1932268761508629172347675945465993672149463664853217499\, 
328617625725759571 144780212268 1339785227067 1 18347067128\ 
0082535 146127367497406661731192968242161709250355573368\ 
5276673 





Table 3.8. Koblitz curves Eq with almost-prime group order #Eq(F2”) and m € [100, 600]. 
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where jz = (—1)!~¢ and 1/(P) denotes the /-fold application of t to P. Hence the 
Frobenius map can be regarded as a complex number T satisfying 

r4+2=pt; (3.32) 


we choose Tt = (W+/—7)/2. Let Z[t] denote the ring of polynomials in t with integer 
coefficients. It now makes sense to multiply points in E,(F 2”) by elements of the ring 
Z[t]: if with! +--+uyt tuo Z[t] and P € E,(F2”), then 








(uj! +--+ yt tuo)P =uj-1t! 1(P)+-+»+uit(P)+u0P. (3.33) 


The strategy for developing an efficient point multiplication algorithm for Koblitz 
curves is to find, for a given integer k, a “nice” expression of the form k = e, u;t', 
and then use (3.33) to compute kP. Here, “nice” means that / is relatively small and 
the nonzero digits u; are small (e.g., +1) and sparse. 

Since t? = pt — 2, every element a in Z[t] can be expressed in canonical form 


a@ = ag+ayt where ao, a € Z. 


Definition 3.56 The norm of a = ag+a,t € Z[T] is the (integer) product of a and its 
complex conjugate. Explicitly, 


N(agp+aq\T) = ag + waoat 4+ 2a¥. 


Theorem 3.57 (properties of the norm function) 
Gi) N(@) => 0 for all w € Z[t] with equality if and only if a = 0. 
(ii) 1 and —1 are the only elements of Z[t] having norm 1. 
(ii) N(t) =2 and N(t—1)=h. 
(iv) N(t™ — 1) = #Eqg(F2) and N((t™ — 1)/(t —1)) =n. 
(v) The norm function is multiplicative; that is, N(a@,a@2) = N(a1)N(q@2) for all 
1,2 € Z[t ib 
(vi) Z[t] is a Euclidean domain with respect to the norm function. That is, for any 


a, B € Z[t] with 6 £0, there exist x, 9 € Z[t] (not necessarily unique) such that 
a =KB+pand N(p) < N(B). 


t-adic non-adjacent form (TNAF) 


It follows from Theorem 3.57 that any positive integer k can be written in the form 
k= ae: ujt! where each u; € {0, £1}. Such a t-adic representation can be obtained 
by repeatedly dividing k by T; the digits u; are the remainders of the division steps. This 
procedure is analogous to the derivation of the binary representation of k by repeated 
division by 2. In order to decrease the number of point additions in (3.33), it is desirable 
to obtain a t-adic representation for k that has a small number of nonzero digits. This 
can be achieved by using the t-adic NAF, which can be viewed as a t-adic analogue of 
the ordinary NAF (Definition 3.28). 
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Definition 3.58 A t-adic NAF or TNAF of a nonzero element « € Z[T] is an expression 
K= a=, ujt' where each u; € {0, +1}, uj; 40, and no two consecutive digits u; are 
nonzero. The length of the TNAF is /. 


Theorem 3.59 (properties of TNAFs) Letk € Z[t],« #0. 
(i) « has a unique TNAF denoted TNAF(x). 
(ii) If the length /(«) of TNAF(«) is greater than 30, then 


logs(N(k)) — 0.55 <1 (k) < logs (N(k)) +3.52. 


(iii) The average density of nonzero digits among all TNAFs of length / is 
approximately 1/3. 


TNAF(«) can be efficiently computed using Algorithm 3.61, which can be viewed 
as a t-adic analogue of Algorithm 3.30. The digits of TNAF(«) are generated by re- 
peatedly dividing « by t, allowing remainders of 0 or +1. If « is not divisible by T, 
then the remainder r € {—1, 1} is chosen so that the quotient (« —r)/t is divisible by 
T, ensuring that the next TNAF digit is 0. Division of a € Z[t] by t and 1? is easily 
accomplished using the following result. 


Theorem 3.60 (division by t and t? inZ[t]) Leta =rotnit € Zt]. 


(i) @ is divisible by t if and only if ro is even. If ro is even, then 
a/t = (11 + r0/2) — (ro/2)T. 


(ii) @ is divisible by t? if and only if r9 =2r; (mod 4). 


Algorithm 3.61 Computing the TNAF of an element in Z[T] 


INPUT: kK =ro+rit € Z[t]. 
OUTPUT: TNAF(k). 
1. i<0. 
2. While 79 4 0 or r1 40 do 
2.1 If ro is odd then: u; <2 — (rp — 2r, mod 4), ro << ro — uj; 
2.2 Else: u; <0. 
2.3 t<ro,ro—rytero/2,7<——t/2,i<itl. 
3. Return(uj_1, Uj—2,..., U1, Uo). 


To compute kP, one can find TNAF(k) using Algorithm 3.61 and then use (3.33). 
By Theorem 3.59(ii), the length of TNAF(k) is approximately log,(N(k)) = 2log,k, 
which is twice the length of NAF(k). To circumvent the problem of a long TNAF, 
notice that if y =k (mod t” —1) then kP = yP for all P € Eg(F2~). This follows 
because 

(t™ —1)(P)=1t"(P)—-P=P-P=o0. 
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It can also be shown that if o =k (mod 6) where 6 = (t” — 1)/(t — 1), then kP = 
pP for all points P of order n in Eg(F2”). The strategy now is to find p € Z[t] of 
as small norm as possible with o = k (mod 4), and then use TNAF(p) to compute 
pP. Algorithm 3.62 finds, for any a, 6 € Z[t] with 6 ~ 0, a quotient « € Z[t] and 
a remainder p € Z[t] with a = «B+ — and N(p) as small as possible. It uses, as a 
subroutine, Algorithm 3.63 for finding an element of Z[t] that is “close” to a given 
complex number Ag +A1T with Ap, A1 € Q. 


Algorithm 3.62 Division in Z[t] 


INPUT: a =ag+qit € Z[t], B =bo9 + bit € Z[t] with B £0. 

OUTPUT: kK =qot+qit, p=rotnit € Z[t] witha =x6B+ p and N(p) < 4N(B). 
1. g9 <agbo + Lagh + 2a1b,, 

- 81 <ajbo —aob. 

. N<—b5 + bobs +207. 

- A0<— g0/N, Ar<—gi/N. 

. Use Algorithm 3.63 to compute (go, g1) <-Round(Ao, 41). 

ro do — bogo + 26191, 

ry —a, —b\ qo — bog — Hb 41. 

-K<—qQT iT, 

p<rotrit. 

. Return(k, (). 











SOeMmIADNAWN 


— 


Algorithm 3.63 Rounding off in Z[t] 


INPUT: Rational numbers Ao and A. 
OUTPUT: Integers go, g1 such that gg + qT is close to complex number A9 + A1T. 
1. Fori from 0 to 1 do 
Ll fic laitsl.ni cai — fi, hi 0. 
2. n<—2n0 +N. 
3. Ifn > 1 then 
3.1 If no —3un, < —1 then hy <p; else ho <1. 
Else 
3.2 Ifnot4un, > 2 then hy <w. 
4. If 7 < —1 then 
4.1 If no —3n; = 1 then hy ——yp; else hg ——1. 
Else 
4.2 If yn +4un, < —2 then hy ——yp. 
5. go< fotho, qa <— fit. 
6. Return(qo, q1). 








Definition 3.64 Let a, 6 € Z[t] with 6B £0. Then a mod f is defined to be the output 
p € Z[t] of Algorithm 3.62. 
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Algorithm 3.62 for computing = k mod 6 is cumbersome to implement on some 
platforms because it requires two multiprecision integer divisions (in step 4). Algo- 
rithm 3.65 computes an element p’ = k (mod 8) without the expensive multiprecision 
integer divisions. We write p’ = k partmod 6. Solinas proved that /(o) < m-+a and if 
C > 2 then /(p’) < m+a+3. However, it is still possible that /(p’) is significantly 
bigger than /(p). This is not a concern in practice since the probability that p’ 4 p 
is less than 2~‘©->)—hence selection of a sufficiently large C ensures p’ = p with 
overwhelming probability. 


Algorithm 3.65 Partial reduction modulo 6 = (t” — 1)/(t — 1) 
InpuT: k € [1,n—1], C > 2, 59 =dp + ud), s} = —d, where 6 =dy +d tT. 
OUTPUT: p’ =k partmod 6. 
lke [eae | ; 
2. Vin <2" +1—#Eg (Fo). 
3. For i from 0 to 1 do 
3.1 g’<5;-k’. j'<—Vin- |g’ /2”"]. 
3.2 Ag L(g! + j/)/2 4/2 + 51/28. 
4. Use Algorithm 3.63 to compute (go, g1) <- Round(Ag, 41). 
. T9o<-k— (80+ 451)Go — 25191, 71 <- 5140 — 5091. 
6. Return(r9 +7r1T). 


Nn 


3.4.2. Point multiplication 


Algorithm 3.66 is an efficient point multiplication method that incorporates the ideas 
of the preceding subsection. Since the length of TNAF(p’) is approximately m, and 
since its density is expected to be about 1/3, Algorithm 3.66 has an expected running 
time of approximately 


hay | (3.34) 
5A. 


Algorithm 3.66 TNAF method for point multiplication on Koblitz curves 


INPUT: Integer k € [1,n —1], P € E(F2”) of order n. 
OUTPUT: kP. 
1. Use Algorithm 3.65 to compute p’ = k partmod 6. 
2. Use Algorithm 3.61 to compute TNAF(p’) = a ujtt. 
3. O<~. 
4. For i from / — 1 downto 0 do 
4.1 Q< TQ. 
4.2 Ifu; =1 then O<~OQO-+P. 
43 Ifu; =—1 then O<QO-P. 
5. Return(Q). 
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Window methods 


If some extra memory is available, the running time of Algorithm 3.66 can be de- 
creased by deploying a window method which processes w digits of o’ at a time. This 
is achieved by using a width-w TNAF, which can be viewed as a t-adic analogue of 
the ordinary width-w NAF (Definition 3.32). 


Theorem 3.67 Let {U,} be the integer sequence defined by Up = 0, U1 = 1, Ug4i = 
LU, — 2Ux_1 for k > 1. 


(i) Uz — wUg_1U, + 2U2_, = 2*! for all k > 1. 


(ii) Let = 2U,-1U,' mod 2¢ for k > 1. (Since U; is odd for each k > 1, ue mod 
2* does indeed exist.) Then +2 = put, (mod 2*) for all k > 1. 


From (3.32) and Theorem 3.67(ii), it follows that the map ¢, : Z[t] + Zow induced 
by T F> ty is a surjective ring homomorphism with kernel {a € Z[t] : t” divides a}. 
Moreover, a set of distinct representatives of the equivalence classes of Z[t] modulo 
t™ is {0, £1, +2, +3,...,4(2"—-! —1), —2"—!}, of which {+1, +3,...,4(2¥-!—D} 
are not divisible by tT. 





Definition 3.68 Let w > 2 be a positive integer. Define a; = i modt™ for i € 
{1s.3,; djaces oul 1}. A width-w TNAF of a nonzero element « € Z[t] is an expres- 
sion K = yo ae where each u; € {0, taj, ta3,...,@,w-1_;}, ui-1 AO, and at 
most one of any w consecutive digits is nonzero. The /ength of the width-w TNAF is /. 


Note that TNAF2(«) = TNAF(x). Tables 3.9 and 3.10 list the a,’s for a € {0, 1} 
and 3 < w < 6. The expressions given for each a, has at most two terms that involve 
powers of t and other a,,’s. TNAF(@,,) = (uj_1,...,41,U0) 1S understood to mean 
yar’. Most of the entries in the last columns of the tables were obtained from 
the TNAF; a few exceptions were made where use of the TNAF is less efficient. With 
these expressions, each a, P can be computed using at most one elliptic curve addition 
operation. 

TNAF,,() can be efficiently computed using Algorithm 3.69. In Algorithm 3.69, 
k mods 2” denotes the integer u satisfying u =k (mod 2”) and —2¥-! <u <2”7!. 
The digits of TNAF,, (e) are obtained by repeatedly dividing p by tT, allowing remain- 
ders y in {0, ta), +a3...,+0,w-1_,}. If p is not divisible by t and the remainder 
chosen is a, where u = dy(e) mods 2”, then (0 —a,)/t will be divisible by ul 
ensuring that the next w — | digits are 0. 
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17.095 +077 


Table 3.9. Expressions for a, =u mod t” fora =0 and3 < w <6. 
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(—1,0,0, —1) 
(—1,0, —1, 0,0, 1) 
(—1,0, —1,0, —1) 

(—1,0, —1,0, 1) 
(1, 0, 0,0, —1) 
(1) 

(1, 0,0, 1,0, —1) 
(1, 0,0, 1,0, 1) 
(—1,0, —1, 0,0, —1) 
(—1,0, —1, 0,0, 1) 
(—1,0, —1,0, —1) 
(—1,0, —1,0, 1) 
(1,0, 0,0, —1) 
(1, 0, 0, 0, 1) 








(=1,0;0;=1, 0; 1,0,=1) 
1,0,1,0,1 

(1,0,0, —1) 

(1,0,0, 1) 

(1:0, = 

(1,0; 
(1,0,0,0,0, —1) 





—17a97 +027 
—17 097 +029 








T7095 +077 


Table 3.10. Expressions for a, = u mod t” fora = 1 and3 < w <6. 
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Algorithm 3.69 Computing a width-w TNAF of an element in Z[t] 
INPUT: w, tw, @y = By + yut for u € {1,3,5,...,2"-!— 1}, p=rotnit €Z[r]. 
OUTPUT: TNAF, (¢). 
1. i< 0. 
2. While 79 4 0 or r; 40 do 
2.1 If ro is odd then 
u<—rot+rity mods 2”. 
Ifu >Othen s <1; elses<—l,u<—u. 
ro 10 — SBu, 11 <1] — Su, Uj <-SQy. 
2.2 Else: u; <0. 
2.3 t<1ro, ror t+ero/2,7<——t/2,i<it+l. 
3. Return(uj;_1,Uj—2,...,U1, Ug). 


Algorithm 3.70 is an efficient point multiplication algorithm that uses the width-w 
TNAE Since the expected length of TNAF(p’) is m, and since its density is expected 
to be about 1/(w +1), Algorithm 3.70 has an expected running time of approximately 


(2"- 14 “] A. (3.35) 
W 


Algorithm 3.70 Window TNAF point multiplication method for Koblitz curves 


INPUT: Window width w, integer k € [1,n—1], P € E(F2”) of order n. 
OUTPUT: KP. 
1. Use Algorithm 3.65 to compute p’ = k partmod 6. 


2. Use Algorithm 3.69 to compute TNAF,,(p’) = ee —9 UiT 
3. Compute P, = a, P, for u € {1,3,5,...,2¥-!—1}. 
4,.Q0<o. 
5. For i from / — 1 downto 0 do 
5.1 Q<T1Q. 
5.2 Ifu; 40 then: 
Let u be such that a, = uj; or a_y, = —Uj;. 
Ifu>Othen O<Q+P,; 
Else O << Q— P_,. 
6. Return(Q). 


3.5 Curves with efficiently computable endomorphisms 


The Frobenius map (Definition 3.55) is an example of an endomorphism of an elliptic 
curve. This section presents a general technique for accelerating point multiplication on 
elliptic curves that have efficiently computable endomorphisms. While the technique 
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does not yield a speedup that is as dramatic as achieved in §3.4 for Koblitz curves 
(where all the point doublings are replaced by much faster applications of the Frobe- 
nius map), it can be used to accelerate point multiplication on a larger class of curves 
including some elliptic curves over large prime fields. Roughly speaking, if the endo- 
morphism can be computed in no more time than it takes to perform a small number of 
point doublings, then the technique eliminates about half of all doublings and reduces 
the point multiplication time by roughly 33%. 


Endomorphisms of elliptic curves 


Let E be an elliptic curve defined over a field K. The set of all points on E whose 
coordinates lie in any finite extension of K is also denoted by E. An endomorphism 
of E over K isamap ¢: E — E such that (co) = oo and $(P) = (g(P),h(P)) for 
all P € E, where g and / are rational functions whose coefficients lie in K. The set of 
all endomorphisms of EF over K forms a ring, called the endomorphism ring of E over 
K. Anendomorphism ¢ is also a group homomorphism, that is, 


p( P| + P2) = b(P1) + ¢(P2) for all P}, P2 € E. 


The characteristic polynomial of an endomorphism ¢ is the monic polynomial f(X) 
of least degree in Z[X] such that f(¢) = 0, that is, f(@)(P) = oo for all P € E. If E is 
a non-supersingular elliptic curve, then the characteristic polynomial of ¢ has degree 1 
or 2. 


Example 3.71 (endomorphisms of elliptic curves) 


(i) Let E be an elliptic curve defined over F,. For each integer m, the multiplication 
by m map [m]: E — E defined by 


[m]: Pt» mP 
is an endomorphism of E defined over Fy. A special case is the negation map 


defined by P +> —P. The characteristic polynomial of [m] is X —m. 


(ii) Let E be an elliptic curve defined over F’,. Then the g-th power map ¢: E > E 
defined by 
d:(x, yy) (Xt, y!), g:0rR Ow 


is an endomorphism of F defined over Fy, called the Frobenius endomorphism. 
The characteristic polynomial of @ is X* —tX +q, where t =q+1—#E (F,). 


(iii) Let p= 1 (mod 4) bea prime, and consider the elliptic curve 
E: y? = x + ax 


defined over F,,. Let i € F, be an element of order 4. Then the map ¢: E > E 
defined by 


o:(x,y)R (-x,iy), @:0rR oO 
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is an endomorphism of FE defined over F,. Note that @(P) can be computed 
using only one multiplication. The characteristic polynomial of ¢ is X7 +1. 


(iv) Let p =1 (mod 3) be a prime, and consider the elliptic curve 
E:y=x°+b 


defined over F,. Let 6 € F, be an element of order 3. Then the map ¢: E > E 
defined by 

: (x,y) (Bx, y), bp: 0OF* 00 
is an endomorphism of E defined over F,. Note that (P) can be computed 
using only one multiplication. The characteristic polynomial of ¢ is X*+ X +1. 


Note 3.72 (integer representation of an endomorphism) Suppose now that EF is an el- 
liptic curve defined over the finite field F,. Suppose also that #£(F,) is divisible by 
a prime n, and that n* does not divide #E (Fj). Then E(F,) contains exactly one 
subgroup of order n; let this subgroup be (P) where P € E(IF,) has order n. If ¢ is 
an endomorphism of E defined over F,, then ¢(P) € E(F,) and hence ¢(P) € (P). 
Suppose that ¢(P) 4 oo. Then we can write 


o(P) =AP for some d € [1,n— 1]. 


In fact 4 is a root modulo n of the characteristic polynomial of ¢. 


Example 3.73 (the elliptic curve P-160) Consider the elliptic curve 
E:y?=x°+3 
defined over the 160-bit prime field F,,, where 


p = 2! _ 229233 
= 14615016373309029182036848327 162830196559323 13743. 


Since p= 1 (mod 3), the curve is of the type described in Example 3.71 (iv). The group 
of F,-rational points on E has prime order 


#E (Fp) =n = 1461501637330902918203687013445034429 194588307251. 
An element of order 3 in F, is 
B = 7714731662108197795522571 1279633767 1037538143582 


and so the map @: E — E defined by @: co > & and ¢: (x, y) & (fx, y) is an 
endomorphism of E defined over F,. The solution 


A = 9038600425 11079968555273866340564498 1 16022318806 


to the equation 2 +A4+1=0 (mod n) has the property that ¢(P) = AP forall Pe 
E(F,). 
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Accelerating point multiplication 
The strategy for computing kP, where k € [0,n — 1], is the following. First write 
k =k, +koA mod n (3.36) 


where the integers k; and kz are of approximately half the bitlength of k. Such an 
expression is called a balanced length-two representation of k. Since 

kKP = kjP+k2AP 
ki P+koo(P), (3.37) 





kP can be obtained by first computing ¢(P) and then using simultaneous multiple 
point multiplication (Algorithm 3.48) or interleaving (Algorithm 3.51) to evaluate 
(3.37). Since k; and kz are of half the bitlength of k, half of the point doublings are 
eliminated. The strategy is effective provided that a decomposition (3.36) and @(P) 
can be computed efficiently. 


Decomposing a multiplier 


We describe one method for obtaining a balanced length-two representation of the 
multiplier k. For a vector v = (a,b) € Z x Z, define 


f(v) =a+bi mod n. 


The idea is to first find two vectors, vy = (a;, D,) and v2 = (a2, bz) in Z x Z such that 
1. vj and v2 are linearly independent over R; 
2. f(vi) = f (v2) = 0; and 


3. vy, and v2 have small Euclidean norm (i.e., |]v;|| = ae + bt ~ ./n, and similarly 
for v2). 


Then, by considering (k, 0) as a vector in Q x Q, we can use elementary linear algebra 
to write 


(k,0) = y,v1 + y2v2, where 1, y2 € Q. 


If we let c: = Lyi] and cz = |y2], where |x] denotes the integer closest to x, then 
Vv = c1v1 + cC2Vv2 is an integer-valued vector close to (k,0) such that f(v) = 0. Thus 
the vector u = (k,0) —v has small norm and satisfies f(u) =k. It follows that the 
components k;, kz of u are small in absolute value and satisfy kj +k24 =k (mod n). 
The independent short vectors v, and v2 satisfying f(v,) = f (v2) = 0 can be found 
by applying the extended Euclidean algorithm (Algorithm 2.19) to n and 2. The algo- 
rithm produces a sequences of equations sjn + ¢;A = 1; where so = 1, t9 = 0, 79 =n, 
5, =0, t) = 1,71; =A. Furthermore, it is easy to show that the remainders r; are strictly 
decreasing and non-negative, that |t;| < |t;+1| for i > 0, and that |s;| < |s;41| and 
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ri-i|ti| +rilti_1| =n for i > 1. Now, let / be the greatest index for which r; > ./n. Then 
it can be easily verified that vy = (7/41, —t/41) satisfies f(v,) =0 and ||v1|| < J/2n, 
and that v2 = (77, —t;) (and also v2 = (7/42, —t/42)) is linearly independent of v; and 
satisfies f (v2) = 0. Heuristically, we would expect v2 to have small norm. Thus v; and 
v2 satisfy conditions 1-3 above. For this choice of vy, v2, we have yj = bok/n and 
y2 = —b\k/n. The method for decomposing k is summarized in Algorithm 3.74. 


Algorithm 3.74 Balanced length-two representation of a multiplier 
INPUT: Integers n, A, k € [0,n—1]. 
OuTPUT: Integers k,, ky such that k = kj +k2A mod n and [ky], |k2| © Jn. 

1. Run the extended Euclidean algorithm (Algorithm 2.19) with inputs n and 2. The 
algorithm produces a sequence of equations sjn +t;A =r; where so = 1, to = 0, 
ro =n, 8; = 0, t} = 1, r) =A, and the remainders r; and are non-negative and 
strictly decreasing. Let / be the greatest index for which r; > /n. 

2. Set (a1, 01) — (71s —H41). 

3. IE 7? +17) < Op. +#,5) then set (a2, br) — (1, —t)); 

Else set (a2, b2) — (7/42, —ti42). 

4. Compute c; = L[bok/n] and co = |—bik/n]. 

3 Compute ky =k—cyja, —c2a2 and kp = —cyb, — coho. 
6. Return(k,, k2). 





Nn 


Example 3.75 (balanced length-two representation of a multiplier k) Consider the 
elliptic curve P-160 defined in Example 3.73. In the notation of Algorithm 3.74 we 
have 


(7, t1) = (218072875 1409538655993509, — 186029539 167685 199353061) 
(r14:1, 141) = (788919430192407951782190, 60288989 1024722752429 129) 
(r142, t142) = (60288989 1024722752429129, —1391809321217130704211319) 

(a1, b1) = (788919430192407951782190, —60288989 1024722752429 129) 
(az, bz) = (60288989 1024722752429129, 1391809321217130704211319). 


~ 














Now, let 
k = 965486288327218559097909069724275579360008398257. 
We obtain 
c1 = 9194466713395 17233512759, cz = 398276613783683332374156 


and 


ky = —98093723971803846754077, kz = 381880690058693066485 147. 
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Example 3.76 (balanced representation for special parameters) The elliptic curve can 
be chosen so that the parameters k; and kz may be obtained with much less effort than 
that required by Algorithm 3.74. For example, consider the curve 


E:y?=x°-2 


over F,, where p = 23 + 3 is prime and, as in Example 3.71(iv), satisfies p = 1 
(mod 3). The group of F,-rational points on E has order 


#E(F,) = 2° —2'% +7 = 63n 
where n is prime. If 


2195 —2 





and p=2° Bis ae 


then f is an element of order 3 in Fp, A satisfies 42 4+2+41=0 (mod n), and A(x, y= 
(Bx, y) for all (x, y) in the order-n subgroup of E (Fy). 

Suppose now that P = (x, y) is in the order-n subgroup of E(F,), and k € [0,n — 1] 
is a multiplier. To find a balanced length-two representation of k, write k = 2!?° ky +k 
for ki, < 2!99. Then 


P = (2! +k) P = (BAFQDK HK) P = (2k +ki)P + 3k, AP 
—_——’ ~— 


ky ko 


= ky (x, y) +ko (Bx, y). 


The method splits a multiplier k < n of approximately 384 bits into k; and ky where 
each is approximately half the bitlength of k. Finally, note that the cost of calculating 
Bx = (238° +2!94 + 1)x is less than a field multiplication. 


Point multiplication algorithm 


Given an elliptic curve E defined over a finite field Fg with a suitable endomorphism 
@, Algorithm 3.77 calculates the point multiplication kP using the decomposition 
k =k, +k2d mod n and interleaving kj P + kof(P). The expected running time is 
approximately 


2 
. si t 
ism) >2D+ Da NA+ Cr+] + po+y ails (3.38) 


where ¢ is the bitlength of n, k; is written with a width-w; NAF, C; denotes the cost of 
the decomposition of k, and C¢ is the cost of finding @(P). The storage requirement is 
Qui-?2 4. 2v2—-2 noints. 
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Since v; and v2 do not depend on k, it is possible to precompute estimates for b;/n 
and —b2/n for use in step 4 of Algorithm 3.74. In this case, only steps 4-6 of Algo- 
rithm 3.74 must be performed, and hence the cost C; is insignificant in the overall point 
multiplication. 


Algorithm 3.77 Point multiplication with efficiently computable endomorphisms 


INPUT: Integer k € [1,n—1], P ¢ E(F,), window widths w; and wa, and A. 
OUTPUT: KP. 
1. Use Algorithm 3.74 to find k; and kz such that k = kj +k2A mod n. 


2. Calculate P? = @(P), and let P; = P. 
3. Use Algorithm 3.30 to compute NAF), (|kj|) = Djog kji2! for j = 1,2. 
4, Let / = max{/),/2} and define kj; =O for]; <i </,1<j <2. 
5. Ifk; <0, then set kj,;<- —k;j; forO<i <l1j,1<j <2. 
6. Compute iP; fori € {1,3,...,2%/-!-1}, <j <2. 
7. OM. 
8. For i from / — 1 downto 0 do 
8.1 O< 20. 
8.2 For j from 1 to 2 do 
If kj, + 0 then 
If kj, > 0 then O-O+Kk; i Pj; 
Else Q <— Q —|k;;|P;. 
9. Return(Q). 


3.6 Point multiplication using halving 


Point multiplication methods based on point halving share strategy with t-adic meth- 
ods on Koblitz curves (§3.4) in the sense that point doubling is replaced by a potentially 
faster operation. As with the efficiently computable endomorphisms in §3.5, the im- 
provement is not as dramatic as that obtained with methods for Koblitz curves, although 
halving applies to a wider class of curves. 

Point halving was proposed independently by E. Knudsen and R. Schroeppel. We 
restrict our attention to elliptic curves EF over binary fields Fy» defined by the equation 


yr +axy =x +ax? +b 


where a,b € Fo, b £0. To simplify the exposition, we assume that Tr(a) = 1 (cf. 
Theorem 3.18).* We further assume that m is prime and that the reduction polynomials 


2The algorithms presented in this section can be modified for binary curves with Tr(a) = 0; however, 
they are more complicated than the case where Tr(a) = 1. 
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are trinomials or pentanomials. These properties are satisfied by the five random curves 
over binary fields recommended by NIST in the FIPS 186-2 standard (see §A.2.2). 

Let P = (x, y) bea point on E with P 4 —P. From §3.1.2, the (affine) coordinates 
of Q = 2P = (u, v) can be computed as follows: 


A=x+y/x (3.39) 
u=d-+A+a (3.40) 
v=x?+uAa+l). (3.41) 





Affine point doubling requires one field multiplication and one field division. With 
projective coordinates and a € {0, 1}, point doubling can be done in four field multipli- 
cations. Point halving is the following operation: given Q = (u, v), compute P = (x, y) 
such that Q = 2P. Since halving is the reverse operation of doubling, the basic idea for 
halving is to solve (3.40) for A, (3.41) for x, and finally (3.39) for y. 

When G is a subgroup of odd order n in E, point doubling and point halving are 
automorphisms of G. Therefore, given a point Q € G, one can always find a unique 
point P € G such that Q = 2P. §3.6.1 and §3.6.2 describe an efficient algorithm for 
point halving in G. In §3.6.3, point halving is used to obtain efficient halve-and-add 
methods for point multiplication in cryptographic schemes based on elliptic curves 
over binary fields. 


3.6.1 Point halving 


The notion of trace plays a central role in deriving an efficient algorithm for point 
halving. 


Definition 3.78 The trace function on F2~ is the function Tr : Fy — Fm defined by 
Tr(c) =c+er2te? $e 02"™. 





Lemma 3.79 (properties of the trace function) Let c,d € Fo. 
(i) Tr(c) = Tr(c*) = Tr(c)?; in particular, Tr(c) € {0, 1}. 
(ii) Trace is linear; that is, Tr(c +d) = Tr(c) + Tr(d). 
(iii) If (u, v) € G, then Tr(w) = Tr(a). 


Property (iii) follows from (3.40) because 
Tr(u) = Tr(a? +A +a) = Tr(a)? + Tr(A) + Tr(a) = Tr(a). 


Given Q = (u, v) € G, point halving seeks the unique point P = (x, y) € G such that 
Q = 2P. The first step of halving is to find A = x + y/x by solving the equation 


M42 =uta (3.42) 
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for 2. An efficient algorithm for solving (3.42) is presented in §3.6.2. Let A denote the 
solution of (3.42) obtained from this algorithm. It is easily verified that 2 € {A,A+ 1}. 
If Tr(a) = 1, the following result can be used to identify A. 


Theorem 3.80 Let P = (x, y), Q = (u,v) € G be such that OQ = 2P, and denote 1 = 
x+y/x. Let 4 be a solution to (3.42), and t =v+ Ud. Suppose that Tr(a) = 1. Then 
2 =A if and only if Tr(t) = 0. 

Proof: Recall from (3.41) that x7 = v+u(A+1). By Lemma 3.79(iii), we get Tr(x) = 
Tr(a) since P = (x, y) € G. Thus, 


Tr(v tu(A +1)) = Tr(x’) = Tr(x) = Tr(a) = 1. 
Hence, ifA=A+ 1, then Tr(t) = Tr(v-+u(A+1)) = | as required. Otherwise, we must 


have A = A, which gives Tr(t) = Tr(v + uA) = Tr(v +u((A+1)+1)). Since the trace 
function is linear, 
Trvtu(A+H)+)) =TrvwtuQ4+1))+Trw) =14+Tr@) =0. 
Hence, we conclude that X=)A if and only if Tr(t) = 0. 
Theorem 3.80 suggests a simple algorithm for identifying 4 in the case that Tr(a) = 
1. We can then solve x* = v+u(A+ 1) for the unique root x. §3.6.2 presents efficient 
algorithms for finding traces and square roots in Fm. Finally, if needed, y = Ax + x? 


may be recovered with one field multiplication. 
Let the A-representation of a point O = (u, v) be (u, Ag), where 














a ie 
=u —, 
2 Uu 


Given the A-representation of Q as the input to point halving, we may compute ¢ in 
Theorem 3.80 without converting to affine coordinates since 


(= vtuh=u(utut—)+uk=uutig th), 
Uu 


In point multiplication, repeated halvings may be performed directly on the A- 
representation of a point, with conversion to affine only when a point addition is 
required. 


Algorithm 3.81 Point halving 


INPUT: A-representation (wu, AQ) or affine representation (u,v) of OQ €G. 
OUTPUT: A-representation (x, Ap) of P = (x, y) €G, where Q =2P. 
1. Find a solution 4 of A2-+A4 =u-+a. 
2. If the input is in A-representation, then compute f = u(u+Ag an): 
else, compute t = v+ UA. 
3. If Tr(t) =0, then Ap <A, x << J? us; 
elseAp<atl,x<Vi. 
4. Return (x, Ap). 
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3.6.2 Performing point halving efficiently 


Point halving requires a field multiplication and three main steps: (i) computing the 
trace of t; (ii) solving the quadratic equation (3.42); and (iii) computing a square 
root. In a normal basis, field elements are represented in terms of a basis of the form 
{B, Pp’. eee pr }. The trace of an element c = )* ci p> = (c0,C1,---, Cm—1) 18 given by 
Tr(c) = )\ cj. The square root computation is a left rotation: ./c = (c1,...,Cm—1, C0). 
Squaring is a right rotation, and x? + x = c can be solved bitwise. These operations 
are expected to be inexpensive relative to field multiplication. However, field multi- 
plication in software for normal basis representations is very slow in comparison to 
multiplication with a polynomial basis. Conversion between polynomial and normal 
bases at each halving appears unlikely to give a competitive method, even if signifi- 
cant storage is used. For these reasons, we restrict our discussion to computations in a 
polynomial basis representation. 


Computing the trace 


Letc= paar cz! € Fon, with c; € {0, 1}, represented as the vector c = (Cm—1,..., C0). 
A primitive method for computing Tr(c) uses the definition of trace, requiring m — 1 
field squarings and m — | field additions. A much more efficient method makes use of 
the property that the trace is linear: 


m—1 m—-1 


Tr(c) = my ai’) = > ciTr(z'). 
i=0 


i=0 


The values Tr(z') may be precomputed, allowing the trace of an element to be found 
efficiently, especially if Tr(z’) = 0 for most 7. 


Example 3.82 (computing traces of elements in F713) Consider F163 with reduction 
polynomial f(z) = 7/63 4.774 764 2341. A routine calculation shows that Tr(z’) = 1 
if and only if 7 € {0,157}. As examples, Tr(z!© + 746) = 0, Tr(z!°?7 + z*°) = 1, and 
Tr(z!97 +24 +1) =0. 


Solving the quadratic equation 


The first step of point halving seeks a solution x of a quadratic equation of the form 
x? +x =c over Fm, The time performance of this step is crucial in obtaining an 
efficient point halving. 


Definition 3.83 Let m be an odd integer. The half-trace function H : Fyn — Fm is 


defined by 
(m—1)/2 


Ai(é)= 2 on 


i=0 
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Lemma 3.84 (properties of the half-trace function) Let m be an odd integer. 
@) H(c+d) = A(c)+ H(d) forall c,d € Fon. 
(ii) AH (c) is a solution of the equation x?7+x=c+Tr(c). 
(iii) H(c) = H(c?) +c+Tr(c) for all c € Fon. 





Let c= ghar cjz! € Fam with Tr(c) = 0; in particular, H(c) is a solution of x* + 
x =c. A simple method for finding H(c) directly from the definition requires m — 1 
squarings and (m — 1)/2 additions. If storage for {H(z') :0 <i < m} is available, then 
Lemma 3.84(i) may be applied to obtain 


m—1 m—1 


H(c)= a( Yea’) = ak’). 


i=0 i=0 


However, this requires storage for m field elements, and the associated method requires 
an average of m/2 field additions. 

Lemma 3.84 can be used to significantly reduce the storage required as well as the 
time needed to solve the quadratic equation. The basic strategy is to write H(c) = 
H(c')+s where c’ has fewer nonzero coefficients than c. For even i, note that 


H(') = A /*) 4- 7 4 Tr(z'). 


Algorithm 3.85 is based on this observation, eliminating storage of H(z‘) for all even 
i. Precomputation builds a table of (m— 1)/2 field elements H (z') for odd i, and the 
algorithm is expected to have approximately m/4 field additions at step 4. The terms 
involving Tr(z') and H(1) have been discarded, since it suffices to produce a solution 
s €{H(c), H(c) +1} of x7 +x =c. 


Algorithm 3.85 Solve x* +x =c (basic version) 


m—1 


INPUT: c= ) 9 cjz! € Fyn where m is odd and Tr(c) = 0. 
Output: A solution s of x? +x =c. 

1. Precompute H(z') for odd i, 1 <i <m-—2. 

2. s<O0. 

3. For i from (m— 1)/2 downto | do 


3.1 If co; = 1 then do: c<—c+2z',5<s+2z!. 
(m—1)/2 : 
4.s<s+ > 07-1 H(z2"-!), 
i=l 
5. Return(s). 


Further improvements are possible by use of Lemma 3.84 together with the reduction 
polynomial f(z). Let i be odd, and define 7 and s by 


m <2i=m+s <2m. 
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The basic idea is to apply Lemma 3.84(iii) j times, obtaining 


H(i) = He) 4h ip pet gg heel tf Te(z!). (3.43) 





Let f(z) =z” +r(z), where r(z) = ze4...47514 1 and0 < by <---<by <m.Then 


A!) = Ar) = A) + AS) po tA) FA). 


Thus, storage for H (z') may be exchanged for storage of H (z°+°) for e € {0, bj, ..., be} 
(some of which may be further reduced). The amount of storage reduction is limited 
by dependencies among elements H(z’). 

If degr < m/2, the strategy can be applied in an especially straightforward fashion 
to eliminate some of the storage for H(z’) in Algorithm 3.85. For m/2 <i <m-—degr, 


H(z!) = He") 4-2 + Tr(z') 
= H(r(z)e2-™) +! + Tr(z') 
_ H(z2i-m tbe 4+. : af gmm+hy fh tot +Tr(z'). 


Since 2i —m-+degr <i, the reduction may be applied to eliminate storage of H(z‘) for 
odd i, m/2 <i <m-—degr. If degr is small, Algorithm 3.86 requires approximately 
m/4 elements of storage. 


Algorithm 3.86 Solve x7 +x =c 


INPUT: c= oS. cjz! € Fyn where m is odd and Tr(c) = 0, and reduction polynomial 


f() =z" +r(z). 
OuTpuT: A solution s of x? +x =c. 
1. Precompute H (z') for i € I7 Uy, where Ip and J consist of the odd integers in 
[1, @m — 1)/2] and [m — degr, m — 2], respectively. 
2. s<O0. 
3. For each odd i € ((m — 1)/2,m—degr), processed in decreasing order, do: 
3.1 If c; =1 then do: cc 77-1 4... 4 4m sees t zi, 
4. For i from (m — 1)/2 downto 1 do: 
4.1 If co; =1 thendo: c<—c+z',s<s+z'. 
5.s<est Do cH’). 
i€lgUh 
6. Return(s). 


The technique may also reduce the time required for solving the quadratic equation, 
since the cost of reducing each H(z!) may be less than the cost of adding a precom- 
puted value of H(z!) to the accumulator. Elimination of the even terms (step 4) can be 
implemented efficiently. Processing odd terms (as in step 3) is more involved, but will 
be less expensive than a field addition if only a few words must be updated. 
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Example 3.87 (Algorithm 3.86 for the field F 5163) Consider F163 with reduction poly- 
nomial f(z) = 7! +27 +2°+z23+1. Step 3 of Algorithm 3.86 begins with i = 155. 
By Lemma 3.84, 
H(z)?) = H(z3!9) + 25 +Tr(z!5) 

= H(z!477163) 4. 2155 

= H(z'47(z7 +e°e + Iie"? 
If c155 = 1, then z!54 4 z!53 4 2150 + 7147 is added to c, and z!°> is added to s. In this 
fashion, storage for H(z') is eliminated for i € {83,85,..., 155}, the odd integers in 
((m — 1)/2,m—degr). 

Algorithm 3.86 uses 44 field elements of precomputation. While this is roughly half 


that required by the basic algorithm, it is not minimal. For example, storage for H(z°!) 
may be eliminated, since 


H(2!) = H(z!) 4.25! 4 Tro!) 
=H (2204) 4 2102 4 251 4 Tyg !02) 4 Tr(z5!) 
= H(z! 241) 4 2102 4 751 


48, 47, 44, 41 102 1 
=A(z" +242 Zz )+z 2 














which corresponds to equation (3.43) with j = 2. The same technique eliminates stor- 
age for H(z'), i € {51,49,...,41}. Similarly, if (3.43) is applied with i = 21 and j =3, 
then 

HQ) =H? te +84 yee tz 





Note that the odd exponents 11 and 5 are less than 21, and hence storage for H (z2!) 
may be eliminated. 

In summary, the use of (3.43) with 7 € {1,2,3} eliminates storage for odd val- 
ues of i € {21,41,...,51,83,..., 155}, and a corresponding algorithm for solving the 
quadratic equation requires 37 elements of precomputation. Further reductions are pos- 
sible, but there are some complications since the formula for H (z') involves H(z/) for 
j >i. As an example, 





H(z) = H(28 227 74 zl) Vaal 746 23 


and storage for H(z?) may be exchanged for storage on H(z?’). These strategies 
reduce the precomputation to 30 field elements, significantly less than the 44 used in 
Algorithm 3.86. In fact, use of 





t t 3 
zt — 7157 ny on Pyogn 4 zn-6 


together with the previous techniques reduces the storage to 21 field elements H (z') for 
i € {157, 73, 69, 65, 61,57, 53, 39, 37, 33, 29, 27, 17, 15, 13, 11,9, 7,5, 3, 1}. However, 
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this final reduction comes at a somewhat higher cost in required code compared with 
the 30-element version. 

Experimentally, the algorithm for solving the quadratic equation (with 21 or 30 ele- 
ments of precomputation) requires approximately 2/3 the time of a field multiplication. 
Special care should be given to branch misprediction factors (§5.1.4) as this algorithm 
performs many bit tests. 


Computing square roots in Fy 


The basic method for computing ./c, where c € Fz”, is based on the little theorem of 
Fermat: c?" = c. Then ./¢ can be computed as //¢ = en , requiring m — | squarings. 
A more efficient method is obtained from the opeeNvauOn that s/c can be expressed in 
terms of the square root of the element z. Let c = ar 0 Vaz € Fon, cj € {0, 1}. Since 
squaring is a linear operation in F2”, the square root of c can be written as 


m—1 gm-1 m-1 
. m—1.- 
Ye=(Hee) = Dae 'y, 


i=0 i=0 
Splitting c into even and odd powers, we have 


(m—1)/2 (m—3) /2 
gm—1 i gm 2i+1 
Jce= ys cy(z~ y+ > mai yr 
i=0 i=0 
(m—1)/2 (m—3)/2 


= dX C2; Z' hy dX Cr41Z" mo zl 
= Fatt yer oe 


i even i odd 


This reveals an efficient method for computing ./c: extract the two half-length vectors 
Ceven = (Cm—1,--+,€4,€2,C0) and Codd = (Cm—2;---,€5,€3,¢1) from c (assuming m is 
odd), perform a field multiplication of cogq of length |m/2| with the precomputed 
value ./z, and finally add this result with ceyen. The computation is expected to require 
approximately half the time of a field multiplication. 

In the case that the reduction polynomial f is a trinomial, the computation of ./c 
can be further accelerated by the observation that an efficient formula for ,/z can be 
derived directly from f. Let f(z) = 2" +z* +1 be an irreducible trinomial of degree 
m, where m > 2 is prime. 

Consider the case that k is odd. Note that 1 = z+ z* (mod J (z)). Then multiplying 
by z and taking the square root, we get 


Vz=evt +22 (mod f(2). 
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Thus, the product ./Z- Coda requires two shift-left operations and one modular 
reduction. 
Now suppose k is even. Observe that 2” = z* +1 (mod f(z)). Then dividing by 


z—! and taking the square root, we get 


Jt=rF (22 +1) (mod f(z). 


In order to compute z~* modulo f(z), where s = “> L, 


gts ckt4 2" (mod f(z)) for 1 < t <k for writing z~* as a sum of few pos- 
itive powers of z. Hence, the product ./Z-+Coaa can be performed with few shift-left 
operations and one modular reduction. 





one can use the congruences 


Example 3.88 (square roots in F 5409) The reduction polynomial for the NIST recom- 
mended finite field F409 is the trinomial f(z) = z4°? + z87 + 1. Then, the new formula 
for computing the square root of c € I7409 is 


205 44 
le = Ceven 1 Z *Codd +Z"" + Codd mod f(z). 





Example 3.89 (square roots in F233) The reduction polynomial for the NIST recom- 
mended finite field F233 is the trinomial f(z) = z7°> + z’4+1. Since k = 74 is even, 
we have ./z = z~!!©. (2974.1) mod f(z). Note that z~”4 = 1+? (mod f(z) 
and z~4? = z32 4 z!9! (mod f(z)). Then one gets that ze = 7324-7 4 7191 
(mod f(z)). Hence, the new method for computing the square root of c € F7233 is 


a/C = Coven + (27? + 24 +2!7!)(297 +1) -coua mod f(z). 


Compared to the standard method of computing square roots, the proposed technique 
eliminates the need of storage and replaces the required field multiplication by a faster 
operation. Experimentally, finding a root in Example 3.89 requires roughly 1/8 the 
time of a field multiplication. 


3.6.3 Point multiplication 


Halve-and-add variants of the point multiplication methods discussed in §3.3 replace 
most point doublings with halvings. Depending on the application, it may be necessary 





to convert a given integer k = (k;_1,...,ko)2 for use with halving-based methods. If k’ 
is defined by 

kok_,/234--+/2+kh,/2+k (mod n) 
then kP = ee, ki /2' P; ie., (k/_,,...,k4) is used by halving-based methods. This 


can be generalized to width-w NAF. 
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Lemma 3.90 Let )7}_,k;2! be the w-NAF representation of 2'~'k mod n. Then 


t-l py 


Menges 
ae Sr + 2k; (mod n). 





Proof: We have 2'~'k = aa Ke (mod n). Since n is prime, the congruence can be 
divided by 2‘! to obtain 




















t ki t-1 ki ‘es 

— i = t—l-i / 

k= y aa y 51 +2k, (mod n). 
i=0 i=0 


Algorithm 3.91 presents a right-to-left version of the halve-and-add method with the 
input 2'~'k mod n represented in w-NAF. Point halving occurs on the input P rather 
than on accumulators. Note that the coefficient k/ is handled separately in step 2 as it 
corresponds to the special term 2k; in k. The expected running time is approximately 


(step 4 cost) + (t/(w + 2) A ee (3.44) 


where H denotes a point halving and A’ is the cost of a point addition when one of 
the inputs is in A-representation. If projective coordinates are used for Q;, then the 
additions in steps 3.1 and 3.2 are mixed-coordinate. Step 4 may be performed by con- 
version of Q; to affine (with cost 1 + (5-2”~? —3)M if inverses are obtained by a 
simultaneous method), and then the sum is obtained by interleaving with appropriate 
signed-digit representations of the odd multipliers 7. The cost of step 4 for 2 < w <5 is 
approximately w — 2 point doublings and 0, 2, 6, or 16 point additions, respectively. 


Algorithm 3.91 Halve-and-add w-NAF (right-to-left) point multiplication 
INPUT: Window width w, NAF,,(2‘~'k mod n) = )7;_)k;2', P eG. 
OUTPUT: kP. (Note: k = kj /2'!+---+k_,/2+k)_, +2k, modn.) 
1. Set Q; < 00 fori € J ={1,3,...,2¥~!— 1}. 
2, Wk, = 1 then Oi =2F 
3. For i from t — 1 downto 0 do: 
3.1 If k; > 0 then Ou<-QyutP. 
3.2 If k; < 0 then 0 ¢<—-Qy—P. 
3.3 P< P/2. 
4. 0<—));.,i Qj. 
5. Return(Q). 





3Knuth suggests calculating Q; — Q; + Qj +2 for i from 2¥—1_3 to 1, and then the result is given by 
Q\i+ 2 iel\{1} Q;. The cost is comparable in the projective point case. 
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Consider the case w = 2. The expected running time of Algorithm 3.91 is then ap- 
proximately (1/3)rA’ +tH. If affine coordinates are used, then a point halving costs 
approximately 2M, while a point addition costs 2M + V since the A-representation of 
P must be converted to affine with one field multiplication. It follows that the field op- 
eration count with affine coordinates is approximately (8/3)tM + (1/3)tV. However, 
if Q is stored in projective coordinates, then a point addition requires 9M. The field op- 
eration count of a mixed-coordinate Algorithm 3.91 with w = 2 is then approximately 
5tM+(2M +1). 

Algorithm 3.92 is a left-to-right method. Point halving occurs on the accumula- 
tor Q, whence projective coordinates cannot be used. The expected running time is 
approximately 

(D+ (2¥~? —1)A) + (t/(w +1) A! +t). (3.45) 








Algorithm 3.92 Halve-and-add w-NAF (left-to-right) point multiplication 
INPUT: Window width w, NAF, (2‘~!k mod n) = )7j_k;2', P €G. 





OuTPUT: kP. (Note: k = kj/2'! +---+K_,/2+k)_, +2k, mod n.) 
1. Compute P; =i P, fori € 13 Suse Nt, 
2. O< oO. 
3. Fori from 0 to t—1 do 
3.1 O<Q/2. 


3.2 If k; > O then O<Q+ Py. 

3.3 Ifk! <Othen O—O- Py. 
4. Ifki=1thenQ<-Q+2P. | 
5. Return(Q). 


Analysis 


In comparison to methods based on doubling, point halving looks best when //M is 
small and kP is to be computed for P not known in advance. In applications, the 
operations kP and kP +1Q with P known in advance are also of interest, and this 
section provides comparative results. The concrete example used is the NIST random 
curve over F163 (§A.2.2), although the general conclusions apply more widely. 


Example 3.93 (double-and-add vs. halve-and-add) Table 3.11 provides an operation 
count comparison between double-and-add and halve-and-add methods for the NIST 
random curve over F!63. For the field operations, the assumption is that J/M = 8 and 
that a field division has cost /+ M. 

The basic NAF halving method is expected to outperform the w-NAF doubling 
methods. However, the halving method has 46 field elements of precomputation. In 
contrast, Algorithm 3.36 with w = 4 (which runs in approximately the same time as 
with w = 5) requires only six field elements of extra storage. 
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Storage Point Field operations (H = 2M, I/M = 8) 
Method (field elts) operations affine projective 
NAF, doubling 
(Algorithm 3.36) 
NAF, halving 
(Algorithm 3.91) 
5-NAF, doubling 
(Algorithm 3.36) 
4-NAF, halving 
(Algorithm 3.91) 
5-NAF, halving 
(Algorithm 3.92) 


163D+54A 217(M+V)=2173 1089M+I=1097 
163H+54A’ 435M+54V= 924 817M+1= 825 
[D+7A]+163D+27A 198(M+V)=1982 879M+8V+I= 959 


[3D+6A]+163H+30A’ — 671M+2I= 687 


[D+7A]+163H+27A’ 388M+35V= 705 = 





Table 3.11. Point and field operation counts for point multiplication for the NIST random curve 
over F2!63, Halving uses 30 field elements of precomputation in the solve routine, and 16 el- 
ements for square root. A’ = A+ M, the cost of a point addition when one of the inputs is in 
A-representation. Field operation counts assume that a division V costs 1+ M. 


The left-to-right w-NAF halving method requires that the accumulator be in affine 
coordinates, and point additions have cost 2M + V (since a conversion from A- 
representation is required). For sufficiently large 1/M, the right-to-left algorithm 
will be preferred; in the example, Algorithm 3.91 with w = 2 will outperform 
Algorithm 3.92 at roughly //M = 11. 


For point multiplication k P where P is not known in advance, the example case in 
Table 3.11 predicts that use of halving gives roughly 25% improvement over a similar 
method based on doubling, when //M = 8. 

The comparison is unbalanced in terms of storage required, since halving was per- 
mitted 46 field elements of precomputation in the solve and square root routines. The 
amount of storage in square root can be reduced at tolerable cost to halving; significant 
storage (e.g., 21-30 elements) for the solve routine appears to be essential. It should 
be noted, however, that the storage for the solve and square root routines is per field. In 
addition to the routines specific to halving, most of the support for methods based on 
doubling will be required, giving some code expansion. 


Random curves vs. Koblitz curves The t-adic methods on Koblitz curves (§3.4) 
share strategy with halving in the sense that point doubling is replaced by a less- 
expensive operation. In the Koblitz curve case, the replacement is the Frobenius map 
T: (x, y) +> (x?, y”), an inexpensive operation compared to field multiplication. Point 
multiplication on Koblitz curves using t-adic methods will be faster than those based 
on halving, with approximate cost for kP given by 


t 
amen pate - (cost of tT) 
w 


when using a width-w t-adic NAF in Algorithm 3.70. To compare with Table 3.11, 
assume that mixed coordinates are used, w =5, and that field squaring has approximate 
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cost M/6. In this case, the operation count is approximately 379M, significantly less 
than the 687M required by the halving method. 


Known point vs. unknown point In the case that P is known in advance (e.g., signa- 
ture generation in ECDSA) and storage is available for precomputation, halving loses 
some of its performance advantages. For our case, and for relatively modest amounts 
of storage, the single-table comb method (Algorithm 3.44) is among the fastest and can 
be used to obtain meaningful operation count comparisons. The operation counts for 
kP using methods based on doubling and halving are approximately 

t 5 2”—-1 d t re iat 

mi 7 a ae | a 2” ) 
respectively. In contrast to the random point case, roughly half the operations are point 
additions. Note that the method based on doubling may use mixed-coordinate arith- 
metic (in which case D = 4M, A = 8M, and there is a final conversion to affine), 
while the method based on halving must work in affine coordinates (with H = 2M and 
A’=V+2M). If V=1I+M, then values of ¢ and w of practical interest give a thresh- 
old [/M between 7 and 8, above which the method based on doubling is expected to 
be superior (e.g., for w = 4 and t = 163, the threshold is roughly 7.4). 


Simultaneous multiple point multiplication In ECDSA signature verification, the 
computationally expensive step is a calculation kP +/Q where only P is known in 
advance. If interleaving (Algorithm 3.51) is used with widths w, and wz, respectively, 
then the expected operation count for the method based on doubling is approximately 


—)A] 


witl wotl 
where the precomputation involving P is not included. (The expected count for the 
method using halving can be estimated by a similar formula; however, a more precise 
estimate must distinguish the case where consecutive additions occur, since the cost is 
A’+V4-+M rather than 2A’.) 

For sufficiently large 7/M, the method based on doubling will be superior; in Ex- 
ample 3.93, this occurs at roughly 7/M = 11.7. When //M is such that halving is 
preferred, the difference is less pronounced than in the case of a random point mul- 
tiplication kP, due to the larger number of point additions relative to halvings. Note 
that the interleaving method cannot be efficiently converted to a right-to-left algorithm 
(where w1 = w2 = 2), since the halving or doubling operation would be required on 
two points at each step. 











[D+ (2? — 1)A] +¢[D + ( 


3.7 Point multiplication costs 


Selection of point multiplication algorithms is complicated by platform characteris- 
tics, coordinate selection, memory and other constraints, security considerations (§5.3), 
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and interoperability requirements. This section presents basic comparisons among al- 
gorithms for the NIST-recommended curves P-192 (over the prime field F,,, for 
P192 = 2192 _ 964 _ 1) and B-163 and K-163 (random and Koblitz curves over the 
binary field F163 = Fo[z]/(z!% +z’ +z6+4 734 1)). The general assumptions are that 
inversion in prime fields is expensive relative to multiplication, a modest amount of 
storage is available for precomputation, and costs for point arithmetic can be estimated 
by considering only field multiplications, squarings, and inversions. 

The execution times of elliptic curve cryptographic schemes are typically dominated 
by point multiplications. Estimates for point multiplication costs are presented for three 
cases: (i) kP where precomputation must be on-line; (ii) kP for P known in advance 
and precomputation may be off-line; and (iii) AP +/@Q where only the precomputation 
for P may be done off-line. The latter two cases are motivated by protocols such as 
ECDSA, where signature generation requires a calculation kP where P is fixed, and 
signature verification requires a calculation kP +/Q where P is fixed and Q is not 
known a priori. 

Estimates are given in terms of curve operations (point additions A and point dou- 
bles D), and the corresponding field operations (multiplications M and inversions 
I). The operation counts are roughly what are obtained using the basic approxima- 
tions presented with the algorithms; however, the method here considers the coordinate 
representations used in precomputation and evaluation stages, and various minor opti- 
mizations. On the other hand, the various representations for the scalars are generally 
assumed to be of full length, overestimating some counts. Nevertheless, the estimation 
method is sufficiently accurate to permit meaningful comparisons. 


Estimates for P-192 


Table 3.12 presents rough estimates of costs in terms of elliptic curve operations and 
field operations for point multiplication methods for P-192, under the assumption that 
field inversion has the cost of roughly 80 field multiplications. The high cost of inver- 
sion encourages the use of projective coordinates and techniques such as simultaneous 
inversion. Most of the entries involving projective coordinates are not very sensitive to 
the precise value of //M, provided that it is not dramatically smaller. 

For point multiplication kP where precomputation must be done on-line, the cost 
of point doubles limits the improvements of windowing methods over the basic NAF 
method. The large inverse to multiplication ratio gives a slight edge to the use of 
Chudnovsky over affine in precomputation for window NAF. Fixed-base methods are 
significantly faster (even with only a few points of storage), where the precomputation 
costs are excluded and the number of point doubles at the evaluation stage is greatly re- 
duced. The cost of processing Q in the multiple point methods for k P +/Q diminishes 
the usefulness of techniques that reduce the number of point doubles for known-point 
multiplication. On the other hand, the cost of kP +/@Q is only a little higher than the 
cost for unknown-point multiplication. 
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Points Field operations 
Method Coordinates w |stored] A | D1] M [| I {Total*| 





Unknown point (k P, on-line precomputation) 

Binary affine 95: 191} 977 | 286 | 23857 
(Algorithm 3.27) | Jacobian-affine 95 191} 2420] 1 | 2500 
Binary NAF affine 63 191] 886 | 254 | 21206 
(Algorithm 3.31) | Jacobian-affine 63 191] 2082] 1 | 2162 
Window NAF Jacobian-affine 4 3 41 193} 1840} 4°] 2160 
(Algorithm 3.36) | Jacobian-Chudnovsky | 5 7 38 192] 1936] 1 | 2016 
Fixed base (k P, off-line precomputation) 

Interleave . 

(Algorithm 3.51) Jacobian-affine 1203 1283 
Windowing Chudnovsky-affine & asad 

(Algorithm 3.41) | Jacobian-Chudnovsky arte oad (ee eel 
Windowing NAF Chudnovsky-affine & cand 

(Algorithm 3.42) | Jacobian-Chudnovsky siaae BIG] | 8 
scabs Jacobian-affi 5 | 30 | 37 | 38] 675] 1 | 755 
(Algorithm 3.44) ~~ *COo!4nratiine 

Comb 2-table : 

Multiple point multiplication (k P +1Q) 


Simultaneous Jacobian-affine & 
(Algorithm 3.48') Jacobian-Chudnovsky oe aie 








Simultaneous JSF F b 
is 2 
Interleave Jacobian-affine & ecaad 
(Algorithm 3.51) | Jacobian-Chudnovsky aero 2270 2206 
“Total cost in field multiplications assuming field inversions have cost J = 80M. 
Simultaneous inversion used in precomputation. “C+A—>C. 4J+C>3 J. *J+ A> J. 








Sliding window variant. 


Table 3.12. Rough estimates of point multiplication costs for the NIST curve over F pj) for prime 
P192 = 2192 __964 1. The unknown point methods for k P include the cost of precomputation, 
while fixed base methods do not. Multiple point methods find kP +1Q where precomputation 
costs involving only P are excluded. Field squarings are assumed to have cost S = .85M. 


The entry for kP by interleaving when P is fixed is understood to mean that Algo- 
rithm 3.51 is used with inputs v = 2, P} = P, Po = 27 P, and half-length scalars kh 
and k? defined by k = 27k? +k! where d = [t/2]. Width-3 NAFs are found for each 
of k! and k*. An alternative with essentially the same cost uses a simultaneous method 
(Algorithm 3.48) modified to process a single column of the joint sparse form (Algo- 
rithm 3.50) of k! and k? at each step. This modified “simultaneous JSF” algorithm is 
referenced in Table 3.12 for multiple point multiplication. 
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Estimates for B-163 and K-163 


Table 3.13 presents rough estimates of costs in terms of elliptic curve operations and 
field operations for point multiplication methods for NIST random and Koblitz curves 
(B-163 and K-163) over the binary field F163 = F [z]/(z!®3 + z7 + 26+ 23+ 1). The 
estimates for TNAF algorithms are for K-163, while the other estimates are for B-163. 
The choice of algorithm and coordinate representations are sensitive to the ratio of 
field inversion to multiplication times, since the ratio is typically much smaller than 
for prime fields. Further, a small ratio encourages the development of a fast division 
algorithm for affine point arithmetic. 


Estimates are presented for the cases //M =5 and [/M = 8 under the assumptions 
that field division V has approximate cost J + M (i.e., division is roughly the same 
cost as inversion followed by multiplication), and that field squarings are inexpensive 
relative to multiplication. The assumptions and cases considered are motivated by re- 
ported results on common hardware and experimental evidence in §5.1.5. Note that if 
V/M <7, then affine coordinates will be preferred over projective coordinates in point 
addition, although projective coordinates are still preferred in point doubling unless 
V/M S3. 


As discussed for P-192, the cost of the point doubling limits the improvements of 
windowing methods over the basic NAF method for B-163. However, the case is dif- 
ferent for Koblitz curves, where doublings have been replaced by inexpensive field 
squarings. The squarings are not completely free, however, and the estimations for the 
TNAE algorithms include field squarings that result from applications of the Frobenius 
map T under the assumption that a squaring has approximate cost S$ ~ M/7. 


Methods based on point halving (§3.6) have been included in the unknown-point 
case, with the assumption that a halving has cost approximately 2M. The predicted 
times are significantly better than those for B-163, but significantly slower than times 
for t-adic methods on the special Koblitz curves. Note, however, that the storage listed 
for halving-based methods ignores the (fixed) field elements used in the solve and 
square root rotines. Similarly, it should be noted that the TNAF routines require support 
for the calculation of t-adic NAFs. 


Fixed-base methods are significantly faster (even with only a few points of storage), 
where the precomputation costs are excluded and the number of point doubles (for B- 
163) at the evaluation stage is greatly reduced. As with P-192, the cost of processing 
Q in the multiple point methods for kP +/Q in B-163 diminishes the usefulness of 
techniques that reduce the number of point doubles for known-point multiplication. The 
case differs for the Koblitz curve, since field squarings replace most point doublings. 


The discussion for P-192 clarifies the meaning of the entry for kP by interleaving 
when P is fixed. The JSF method noted in the entries for kP +/Q has essentially the 
same cost and could have been used. The entry for interleaving with TNAFs is obtained 
by adapting the interleaving algorithm to process TNAFs. 
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Points Field operations* 
Method Coordinates} w |stored| A | D | M [| I {[I/M=S|I/M=8 


Unknown point (k P, on-line precomputation) 

Binary affine 81 162 | 486 | 243] 1701 | 2430 
Binary NAF affine 54 162 | 432 | 216] 1512 | 2160 
Window NAF affine 4 3 35 163 | 396 | 198} 1386 | 1980 
Montgomery affine 162° 162° | 328 |325} 1953 | 2928 
(Algorithm 3.40) projective rE 1624 | 982 | 1 987 990 
Halving w-NAF affine 5 7 74+27© | 14163" | 423 | 35 | 598 705 
TNAF affine 54 os 154 | 54 | 424 | 586 
Window TNAF affine 5 7 34 Of | 114 | 34 | 284 | 386 
Fixed base (kP, off-line precomputation) 

Interleave affine 3,3] 3 41 81 244 | 122) 854 | 1220 
Windowing affine 5 32 61 122 | 61 | 427 610 
Windowing NAF affine 5 32 52 104 | 52 | 364 520 
Comb affine 5 30 31 32 126 | 63 | 441 630 
Window TNAF affine 15 23 os 92 | 23 | 207 276 
Multiple point multiplication (k P +1Q) 

Simultaneous JSF affine 2 83 162 | 490 | 245] 1715 | 2450 
Simultaneous affine 2 10 78 163 | 482 | 241} 1687 | 2410 











Interleave affine 6,4] 18 60 163 448 | 224] 1568 | 2240 
Interleave TNAF affine 6,5| 22 59 0s 164 | 59 | 459 636 


*Right columns give costs in terms of field multiplications for 7/M =5 and I/M = 8, resp. 


baffine. ©Addition via (3.23). 4¥-coordinate only. ©Cost A+ M. ‘Halvings; estimated cost 2M. 
Field ops include applications of t with S = M/7. hp4p-— P. "Sliding window variant. 





Table 3.13. Rough estimates of point multiplication costs for the NIST curves over 
F 163 = F9[z]/(z 1634-74 764 234 1). The unknown point methods for k P include the cost of 
precomputation, while fixed base methods do not. Multiple point methods find k P +1Q where 
precomputation costs involving only P are excluded. Precomputation is in affine coordinates. 
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NIST Field Pentium III (800 MHz) 
curve Method mult M normalized Mus 


Unknown point (k P, on-line precomputation) 
P-192 5-NAF (Algorithm 3.36, w = 5) 
B-163 4-NAF (Algorithm 3.36, w = 4) 


B-163 Halving (Algorithm 3.91, w = 4) 
K-163 5-TNAF (Algorithm 3.70, w = 5) 


Fixed base (k P, off-line precomputation) 

P-192 Comb 2-table (Algorithm 3.45, w = 4) 

B-163 Comb (Algorithm 3.44, w = 5) 

K-163 6-TNAF (Algorithm 3.70, w = 6) 

Multiple point multiplication (k P +1Q) 

P-192 Interleave (Algorithm 3.51, w = 6,5) 2306 
B-163 Interleave (Algorithm 3.51, w = 6, 4) 1154 








K-163 Interleave TNAF (Alg. 3.51 & 3.69, w = 6,5) 565 





Table 3.14. Point multiplication timings on an 800 MHz Intel Pentium III using general-purpose 
registers. M is the estimated number of field multiplications under the assumption that I/M = 80 
and I/M =8 in the prime and binary fields, resp. The normalization gives equivalent P-192 field 
multiplications for this implementation. 


Summary 


The summary multiplication counts in Tables 3.12 and 3.13 are not directly compa- 
rable, since the cost of field multiplication can differ dramatically between prime and 
binary fields on a given platform and between implementations. Table 3.14 gives field 
multiplication counts and actual execution times for a specific implementation on an 
800 MHz Intel Pentium III. The ratio of binary to prime field multiplication times in 
this particular case is approximately 3.1 (see §5.1.5), and multiplication counts are 
normalized in terms of P-192 field multiplications. 


As a rough comparison, the times show that unknown-point multiplications were 
significantly faster in the Koblitz (binary) case than for the random binary or prime 
curves, due to the inexpensive field squarings that have replaced most point doubles. 
In the known point case, precomputation can reduce the number of point doubles, and 
the faster prime field multiplication gives P-192 the edge. For kP +/Q where only the 
precomputation for k P may be off-line, the times for K-163 and P-192 are comparable, 
and significantly faster than the corresponding time given for B-163. 


The execution times for methods on the Koblitz curve are longer than predicted, in 
part because the cost of finding t-adic NAFs is not represented in the estimates (but is 
included in the execution times). Algorithms 3.63 and 3.65 used in finding t-adic NAFs 
were implemented with the “big number” routines from OpenSSL (see Appendix C). 
Note also that limited improvements in the known-point case for the Koblitz curve may 
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be obtained via interleaving (using no more precomputation storage than granted to the 
method for P-192). 

There are several limitations of the comparisons presented here. Only general- 
purpose registers were used in the implementation. Workstations commonly have 
special-purpose registers that can be employed to speed field arithmetic. In particular, 
the Pentium III has floating-point registers which can accelerate prime field arithmetic 
(see §5.1.2), and single-instruction multiple-data (SIMD) registers that are easily har- 
nessed for binary field arithmetic (see §5.1.3). Although all Pentium family processors 
have a 32 x32 integer multiplier giving a 64-bit result, multiplication with general- 
purpose registers on P6 family processors such as the Pentium III is faster than on 
earlier Pentium or newer Pentium 4 processors. The times for P-192 may be less 
competitive compared with Koblitz curve times on platforms where hardware inte- 
ger multiplication is weaker or operates on fewer bits. For the most part, we have not 
distinguished between storage for data-dependent items and storage for items that are 
fixed for a given field or curve. The case where a large amount of storage is available 
for precomputation in known-point methods is not addressed. 


3.8 Notes and further references 


§3.1 

A brief introduction to elliptic curves can be found in Chapter 6 of Koblitz’s book [254]. 
Intermediate-level textbooks that provide proofs of many of the basic results used in 
elliptic curve cryptography include Charlap and Robbins [92, 93], Enge [132], Silver- 
man and Tate [433], and Washington [474]. The standard advanced-level reference on 
the theory of elliptic curves are the two books by Silverman [429, 430]. 


Theorem 3.8 is due to Waterhouse [475]. Example 3.17 is from Wittmann [484]. 


§3.2 

Chudnovsky and Chudnovsky [96] studied four basic models of elliptic curves in- 
cluding: (i) the Weierstrass model y? + a,xy + a3y = x° +.a2x* +.a4x +a6 used 
throughout this book; (ii) the Jacobi model y? = x* + ax? +b; (iii) the Jacobi form 
which represents the elliptic curve as the intersection of two quadrics x? + y* = | and 
k?x? +27 = 1; and (iv) the Hessian form ae y? 4+3= Dxyz. Liardet and Smart [291] 
observed that the rules for adding and doubling points in the Jacobi form are the same, 
thereby potentially increasing resistance to power analysis attacks. Joye and Quisquater 
[231] showed that this property also holds for the Hessian form, and concluded that the 
addition formulas for the Hessian form require fewer field operations than the addi- 
tion formulas for the Jacobi form (12 multiplications versus 16 multiplications). Smart 
[442] observed that the symmetry in the group law on elliptic curves in Hessian form 
can be exploited to parallelize (to three processors) the addition and doubling of points. 
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Note that since the group of F,-rational points on an elliptic curve in Hessian form de- 
fined over F, must contain a point of order 3, the Hessian form cannot be used for 
the elliptic curves standardized by NIST. Elliptic curves in Hessian form were studied 
extensively by Frium [151]. 


Chudnovsky coordinates were proposed by Chudnovsky and Chudnovsky [96]. The 
different combinations for mixed coordinate systems were compared by Cohen, Miyaji 
and Ono [100]. Note that their modified Jacobian coordinates do not yield any speedups 
over (ordinary) Jacobian coordinates in point addition and doubling for elliptic curves 
y? =x? +ax +b with a = —3; however, the strategy is useful in accelerating repeated 
doublings in Algorithm 3.23. Lim and Hwang [293] choose projective coordinates cor- 
responding to (X/Z*, Y/2Z>); the division by 2 is eliminated, but point addition then 
requires two more field additions. 


LD coordinates were proposed by Lépez and Dahab [300]. The formulas reflect an 
improvement due to Lim and Hwang [294] and Al-Daoud, Mahmod, Rushdan, and 
Kilicman [10] resulting in one fewer multiplication (and one more squaring) in mixed- 
coordinate point addition. 


If field multiplication is via a method similar to Algorithm 2.36 with a data-dependent 
precomputation phase, then King [246] suggests organizing the point arithmetic to 
reduce the number of such precomputations (i.e., a table of precomputation may be 
used more than once). Depending on memory constraints, a single preserved table of 
precomputation is used, or multiple and possibly larger tables may be considered. 


Algorithm 3.23 for repeated doubling is an example of an improvement possible when 
combinations of point operations are performed. An improvement of this type is sug- 
gested by King [246] for the point addition and doubling formulas given by Lépez and 
Dahab [300]. A field multiplication can be traded for two squarings in the calculation 
of 2(P + Q), since the value X3Z3 required in the addition may be used in the subse- 
quent doubling. The proposal by Eisentraéger, Lauter, and Montgomery [129] is similar 
in the sense that a field multiplication is eliminated in the calculation of 2P + Q in 
affine coordinates (by omitting the calculation of the y-coordinate of the intermediate 
value P+ Qin2P+Q=(P+Q)+P). 


§3.3 

The right-to-left binary method is described in Knuth [249], along with the gen- 
eralization to an m-ary method. Cohen [99] discusses right-to-left and left-to-right 
algorithms with base 2‘. Gordon [179] provides a useful survey of exponentiation 
methods. Menezes, van Oorschot, and Vanstone [319] cover exponentiation algorithms 
of practical interest in more generality than presented here. 


The density result in Theorem 3.29 is due to Morain and Olivos [333]. The window 
NAF method (Algorithms 3.36 and 3.35) is from Solinas [446], who remarks that 
“More elaborate window methods exist (see [179]), but they can require a great deal of 
initial calculation and seldom do much better than the technique presented here.” 
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Moller [329] presents a fractional window technique that generalizes the sliding win- 
dow and window NAF approaches. The method has more flexibility in the amount of 
precomputation, of particular interest when memory is constrained (see Note 3.39). For 
window width w > 2 and given odd parameter v < 2”~! —3, the fractional window rep- 
resentation has average density 1/(w+1+ ie i ); the method is fractional in the sense 


that the effective window size has increased by + | compared with the width-w NAF. 


Algorithm 3.40 is due to Lopez and Dahab [299], and is based on an idea of Mont- 
gomery [331]. Okeya and Sakurai [359] extended this work to elliptic curves over finite 
fields of characteristic greater than three. 








The fixed-base windowing method (Algorithm 3.41) is due to Brickell, Gordon, Mc- 
Curley, and Wilson [72]. Gordon [179] cites the papers of de Rooij [109] and Lim 
and Lee [295] for vector addition chain methods that address the “observation that the 
BGMW method tends to use too much memory.” Special cases of the Lim-Lee method 
[295] appear in Algorithms 3.44 and 3.45; the general method is described in Note 
3.47. 


The use of simultaneous addition in Note 3.46 for Lim-Lee methods is described 
by Lim and Hwang [294]. An enhancement for combing parameter v > 2 (see Note 
3.47) is given which reduces the number of inversions from v — | in a straightforward 
generalization to [log, v] (with |v/2 |e elements of temporary storage). 


“Shamir’s trick” (Algorithm 3.48) for simultaneous point multiplication is attributed 
by ElGamal [131] to Shamir. The improvement with use of a sliding window is due 
to Yen, Laih, and Lenstra [487]. The joint sparse form is from Solinas [447]. Proos 
[383] generalizes the joint sparse form to any number of integers. A related “zero col- 
umn combing” method is also presented, generalizing the Lim-Lee method with signed 
binary representations to increase the number of zero columns in the exponent array. 
The improvement (for similar amounts of storage) depends on the relative costs of 
point addition and doubling and the amount of storage for precomputation; if additions 
have the same cost as doubles, then the example with 160-bit & and 32 points or less 
of precomputation shows approximately 10% decrease in point operations (excluding 
precomputation) in calculating k P. 


Interleaving (Algorithm 3.51) is due to Gallant, Lambert, and Vanstone [160] and 
Moller [326]. Moller [329] notes that the interleaving approach for kP where k is 
split and then w-NAFs are found for the fragments can “waste” part of each w-NAF. A 
window NAF splitting method is proposed, of particular interest when w is large. The 
basic idea is to calculate the w-NAF of k first, and then split. 


§3.4 

Koblitz curves are so named because they were first proposed for cryptographic use 
by Koblitz [253]. Koblitz explained how a t-adic representation of an integer k can be 
used to eliminate the point doubling operations when computing kP for a point P ona 
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Koblitz curve. Meier and Staffelbach [312] showed how a short t-adic representation 
of k can be obtained by first reducing k modulo t™” — 1 in Z[t]. TNAFs and width- 
w TNAFs were introduced by Solinas [444]. The algorithms were further developed 
and analyzed in the extensive article by Solinas [446]. Park, Oh, Lee, Lim and Sung 
[370] presented an alternate method for obtaining short t-adic representations. Their 
method reduces the length of the t-adic representation by about log, A, and thus offers 
a significant improvement only if the cofactor h is large. 


Some of the techniques for fast point multiplication on Koblitz curves were extended to 
elliptic curves defined over small binary fields (e.g., F52, Fo3, Fo4 and Fs) by Miiller 
[334], and to elliptic curves defined over small extension fields of odd characteristic 
by Koblitz [256] and Smart [439]. Giinther, Lange and Stein [185] proposed gener- 
alizations for point multiplication in the Jacobian of hyperelliptic curves of genus 2, 
focusing on the curves y?+xy =x°+1 and y?+xy =x°+x7+41 defined over F. 
Their methods were extended by Choie and Lee [94] to hyperelliptic curves of genus 
2, 3 and 4 defined over finite fields of any characteristic. 





§3.5 

The method for exploiting efficiently computable endomorphisms to accelerate point 
multiplication on elliptic curves is due to Gallant, Lambert and Vanstone [160], who 
also presented Algorithm 3.74 for computing a balanced length-two representation of 
a multiplier. The P-160 curve in Example 3.73 is from the wireless TLS specification 
[360]. Example 3.76 is due to Solinas [447]. 


Sica, Ciet and Quisquater [428] proved that the vector v2 = (a2, b2) in Algorithm 3.74 
has small norm. Park, Jeong, Kim and Lim [368] presented an alternate method for 
computing balanced length-two representations and proved that their method always 
works. Their experiments showed that the performances of this alternate decomposition 
method and of Algorithm 3.74 are the same in practice. Another method was proposed 
by Kim and Lim [242]. The Gallant-Lambert-Vanstone method was generalized to bal- 
anced length-m multipliers by Miiller [335] and shown to be effective for speeding up 
point multiplication on certain elliptic curves defined over optimal extensions fields. 
Generalizations to hyperelliptic curves having efficiently computable endomorphisms 
were proposed by Park, Jeong and Lim [369]. 


Ciet, Lange, Sica, and Quisquater [98] extend the technique of t-adic expansions on 
Koblitz curves to curves over prime fields having an endomorphism @ with norm ex- 
ceeding 1. In comparison with the Gallant-Lambert-Vanstone method, approximately 
(log,n)/2 point doubles in the calculation of kP are replaced by twice as many appli- 
cations of ¢. A generalization of the joint sparse form (§3.3.3) to a d-JSF is given for 
endomorphism @ having characteristic polynomial x7 + x +2. 
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§3.6 

Point halving was proposed independently by Knudsen [247] and Schroeppel [413]. 
Additional comparisons with methods based on doubling were performed by Fong, 
Hankerson, Lopez and Menezes [144]. 


The performance advantage of halving methods is clearest in the case of point multi- 
plication kP where P is not known in advance, and smaller inversion to multiplication 
ratios generally favour halving. Knudsen’s analysis [247] gives halving methods a 39% 
advantage for the unknown point case, under the assumption that //M ~ 3. Fong, Han- 
kerson, Lopez and Menezes [144] suggest that this ratio is too optimistic on common 
SPARC and Pentium platforms, where the fastest times give [/M > 8. The larger ratio 
reduces the advantage to approximately 25% in the unknown-point case under a similar 
analysis; if P is known in advance and storage for a modest amount of precomputation 
is available, then methods based on halving are inferior. For kP +/Q where only P 
is known in advance, the differences between methods based on halving and methods 
based on doubling are smaller, with halving methods faster for ratios 7/M commonly 
reported. 


Algorithm 3.91 partially addresses the challenge presented in Knudsen [247] to de- 
rive “‘an efficient halving algorithm for projective coordinates.” While the algorithm 
does not provide halving on a projective point, it does illustrate an efficient windowing 
method with halving and projective coordinates, especially applicable in the case of 
larger [/M. Footnote 3 concerning the calculation of Q is from Knuth [249, Exercise 
4.6.3-9]; see also Moller [326, 329]. 


§3.7 

Details of the implementation used for Table 3.14 appear in §5.1.5. In short, only 
general-purpose registers were used, prime field arithmetic is largely in assembly, and 
binary field arithmetic is entirely in C except for a one-line fragment used in polynomial 
degree calculations. The Intel compiler version 6 along with the Netwide Assembler 
(NASM) were used on an Intel Pentium II running the Linux 2.2 operating system. 


The 32-bit Intel Pentium III is roughly categorized as workstation-class, along with 
other popular processors such as the DEC Alpha (64-bit) and Sun SPARC (32-bit and 
64-bit) family. Lim and Hwang [293, 294] give extensive field and curve timings for 
the Intel Pentium II and DEC Alpha, especially for OEFs. Smart [440] provides com- 
parative timings on a Sun UltraSPARC Ili and an Intel Pentium Pro for curves over 
prime, binary, and optimal extension fields. The NIST curves are the focus in Hanker- 
son, Lépez, and Menezes [189] and Brown, Hankerson, Lopez, and Menezes [77], with 
field and curve timings on an Intel Pentium II. De Win, Mister, Preneel, and Wiener 
[111] compare ECDSA to DSA and RSA signature algorithms, with timings on an Intel 
Pentium Pro. Weimerskirch, Stebila, and Chang Shantz [478] discuss implementations 
for binary fields that handle arbitrary field sizes and reduction polynomials; timings are 
given on a Pentium III and for 32- and 64-bit code on a Sun UltraSPARC II. 
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Special-purpose hardware commonly available on workstations can dramatically speed 
operations. Bernstein [43] gives timings for point multiplication on the NIST curve 
over F,, for p= 274 96 + 1 using floating-point hardware on AMD, DEC, Intel, and 
Sun processors at http://cr.yp.to/nistp224/timings.html. §5.1 provides an overview of 
the use of floating-point and SIMD hardware. 





CHAPTER 4 


Cryptographic Protocols 


This chapter describes some elliptic curve-based signature, public-key encryption, and 
key establishment schemes. §4.1 surveys the state-of-the-art in algorithms for solving 
the elliptic curve discrete logarithm problem, whose intractability is necessary for the 
security of all elliptic curve cryptographic schemes. Also discussed briefly in §4.1 are 
the elliptic curve analogues of the Diffie-Hellman and decision Diffie-Hellman prob- 
lems whose hardness is assumed in security proofs for some protocols. §4.2 and §4.3 
consider the generation and validation of domain parameters and key pairs for use in 
elliptic curve protocols. The ECDSA and EC-KCDSA signature schemes, the ECIES 
and PSEC public-key encryption schemes, and the STS and ECMQV key establish- 
ment schemes are presented in §4.4, §4.5, and §4.6, respectively. Extensive chapter 
notes and references are provided in §4.7. 


4.1 The elliptic curve discrete logarithm problem 


The hardness of the elliptic curve discrete logarithm problem is essential for the 
security of all elliptic curve cryptographic schemes. 


Definition 4.1 The elliptic curve discrete logarithm problem (ECDLP) is: given an 
elliptic curve E defined over a finite field Fj, a point P € E(F,) of order n, and a point 
Q € (P), find the integer / € [0,n — 1] such that Q =/P. The integer / is called the 
discrete logarithm of Q to the base P, denoted | = log p Q. 


The elliptic curve parameters for cryptographic schemes should be carefully cho- 
sen in order to resist all known attacks on the ECDLP. The most naive algorithm for 
solving the ECDLP is exhaustive search whereby one computes the sequence of points 
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P,2P,3P,4P,... until Q is encountered. The running time is approximately n steps 
in the worst case and n/2 steps on average. Therefore, exhaustive search can be cir- 
cumvented by selecting elliptic curve parameters with n sufficiently large to represent 
an infeasible amount of computation (e.g., n > 28°). The best general-purpose attack 
known on the ECDLP is the combination of the Pohlig-Hellman algorithm and Pol- 
lard’s rho algorithm, which has a fully-exponential running time of O(,/p) where p is 
the largest prime divisor of n. To resist this attack, the elliptic curve parameters should 
be chosen so that n is divisible by a prime number p sufficiently large so that ./p steps 
is an infeasible amount of computation (e.g., p > 2160), If, in addition, the elliptic curve 
parameters are carefully chosen to defeat all other known attacks (see §4.1.4), then the 
ECDLP is believed to be infeasible given the state of today’s computer technology. 

It should be noted that there is no mathematical proof that the ECDLP is intractable. 
That is, no one has proven that there does not exist an efficient algorithm for solving 
the ECDLP. Indeed, such a proof would be extremely surprising. For example, the non- 
existence of a polynomial-time algorithm for the ECDLP would imply that P 4 NP thus 
settling one of the fundamental outstanding open questions in computer science.! Fur- 
thermore, there is no theoretical evidence that the ECDLP is intractable. For example, 
the ECDLP is not known to be NP-hard,” and it is not likely to be proven to be NP-hard 
since the decision version of the ECDLP is known to be in both NP and co-NP.? 

Nonetheless, some evidence for the intractability of the ECDLP has been gath- 
ered over the years. First, the problem has been extensively studied by researchers 
since elliptic curve cryptography was first proposed in 1985 and no general-purpose 
subexponential-time algorithm has been discovered. Second, Shoup has proven a lower 
bound of ./n for the discrete logarithm problem in generic groups of prime order n, 
where the group elements are random bit strings and one only has access to the group 
operation through a hypothetical oracle. While Shoup’s result does not imply that the 
ECDLP is indeed hard (since the elements of an elliptic curve group have a mean- 
ingful and non-random representation), it arguably offers some hope that the discrete 
logarithm problem is hard in some groups. 

The Pohlig-Hellman and Pollard’s rho algorithms for the ECDLP are presented in 
§4.1.1 and §4.1.2, respectively. In §4.1.3, we survey the attempts at devising general- 
purpose subexponential-time attacks for the ECDLP. Isomorphism attacks attempt to 
reduce the ECDLP to the DLP in an isomorphic group for which subexponential-time 


'P is the complexity class of decision (YES/NO) problems with polynomial-time algorithms. NP is the 
complexity class of decision problems whose YES answers can be verified in polynomial-time if one is 
presented with an appropriate proof. While it can readily be seen that P C NP, it is not known whether 
P= NP. 

2 problem is NP-hard if all NP problems polynomial-time reduce to it. NP-hardness of a problem is 
considered evidence for its intractability since the existence of a polynomial-time algorithm for the problem 
would imply that P = NP. 

3co-NP is the complexity class of decision problems whose NO answers can be verified in polynomial- 
time if one is presented with an appropriate proof. It is not known whether NP = co-NP. However, the 
existence of an NP-hard decision problem that is in both NP and co-NP would imply that NP = co-NP. 
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(or faster) algorithms are known. These attacks include the Weil and Tate pairing at- 
tacks, attacks on prime-field-anomalous curves, and the Weil descent methodology. 
While the mathematics behind these isomorphism attacks is quite sophisticated, the 
cryptographic implications of the attacks can be easily explained and there are simple 
countermeasures known for verifying that a given elliptic curve is immune to them. For 
these reasons, we have chosen to restrict the presentation of the isomorphism attacks 
in §4.1.4 to the cryptographic implications and countermeasures, and have excluded 
the detailed mathematical descriptions of the attacks. Finally, §4.1.5 considers two 
problems of cryptographic interest that are related to the ECDLP, namely the elliptic 
curve Diffie-Hellman problem (ECDHP) and the elliptic curve decision Diffie-Hellman 
problem (ECDDHP). 


4.1.1 Pohlig-Hellman attack 


The Pohlig-Hellman algorithm efficiently reduces the computation of / = logp Q to 
the computation of discrete logarithms in the prime order subgroups of (P). It follows 
that the ECDLP in (P) is no harder than the ECDLP in its prime order subgroups. 
Hence, in order to maximize resistance to the Pohlig-Hellman attack, the elliptic curve 
parameters should be selected so that the order n of P is divisible by a large prime. We 
now outline the Pohlig-Hellman algorithm. 

Suppose that the prime factorization of n is n = a a --+ p;’. The Pohlig-Hellman 
strategy is to compute /; = / mod p;' for each 1 <i <r, and then solve the system of 
congruences 


1=1l, (mod p'') 
=I) (mod p;’) 


1=l, (mod p*") 


for / € [0,n — 1]. (The Chinese Remainder Theorem guarantees a unique solution.) We 
show how the computation of each /; can be reduced to the computation of e; discrete 
logarithms in the subgroup of order p; of (P). To simplify the notation, we write p for 
p; and e for e;. Let the base-p representation of /; be 


2 


l.=zotzuptzp t+::-4 Zea’ 





where each z; € [0, p — 1]. The digits zo, z1,..., Ze—1 are computed one at a time as 
follows. We first compute Po = (n/p) P and Qo = (n/p) Q. Since the order of Po is p, 
we have 


n n 
QOo= “9=1(4) =1P)9=z0Po. 
Pp P 
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Hence zo = log Py) Qo can be obtained by solving an ECDLP instance in (Po). Next, we 
compute Q; = (n/p?)(O — zo P). We have 


n n Z 
Q1 = (2 —z0P) = 5-20) P = CE — 20) (4°) 
Pp P - 


n n 
= (zo +21 p — Zo) (=?) =Z1 (=r) =z Po. 
Pp Pp 


Hence z; = logp, Qi can be obtained by solving an ECDLP instance in (Po). In 


general, if the digits zo, z1,..., Z;—1 have been computed, then z; = log Py Qt, Where 
- n 2 t-l 
= 7 (Q-zP-zpP-zp Pats ge ip P). 
Pp 


Example 4.2 (Pohlig-Hellman algorithm for solving the ECDLP) Consider the ellip- 
tic curve E defined over 7919 by the equation: 


E:y? =x>+1001x +75. 
Let P = (4023, 6036) € E(IF7919). The order of P is 
n = 7889 =7° -23. 


Let Q = (4135, 3169) € (P). We wish to determine / = log p Q. 
(i) We first determine /; =/ mod 7°. We write 1) = zo + z17+ 2277 and compute 


Py =7°23P = (7801, 2071) 
Qo = 7°23Q = (7801, 2071) 


and find that Qo = Po; hence zo = 1. We next compute 
Q; =7-23(Q — P) = (7285, 14) 
and find that Q; = 3 Po; hence z,; = 3. Finally, we compute 
Q2 = 23(Q — P —3-7P) = (7285, 7905) 


and find that Q2 = 4 Pp; hence z2 = 4. Thus /; = 1437-44-77 = 218. 


(ii) We next determine /7 =/ mod 23. We compute 


Py = 7° P = (7190, 7003) 
Oo =7°O = (2599, 759) 


and find that Og = 10P 9; hence /2 = 10. 
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(iii) Finally, we solve the pair of congruences 


1=218 (mod 7°) 
1=10 (mod 23) 


and obtain / = 4334. 


For the remainder of §4.1, we will assume that the order n of P is prime. 


4.1.2. Pollard’s rho attack 


The main idea behind Pollard’s rho algorithm is to find distinct pairs (c’,d’) and 
(c”, d”) of integers modulo n such that 


c’P+d'Q=c"P+d"0. 


Then 
(c'-—c")P= (d” —d')O = (d" —d')IP 


and so 
(c —c") =(d"—d')l_ (mod n). 


Hence / = log p Q can be obtained by computing 
1 = (c'—c")(d" —d’)“' mod n. (4.1) 


A naive method for finding such pairs (c’,d’) and (c”,d”) is to select random in- 
tegers c,d € [0,n — 1] and store the triples (c,d,cP +dQ) in a table sorted by third 
component until a point c P +d Q is obtained for a second time—such an occurrence is 
called a collision. By the birthday paradox,* the expected number of iterations before 
a collision is obtained is approximately /7n/2 © 1.2533,/n. The drawback of this 
algorithm is the storage required for the ./7/2 triples. 

Pollard’s rho algorithm finds (c’, d’) and (c”, da”) in roughly the same expected time 
as the naive method, but has negligible storage requirements. The idea is to define 
an iterating function f : (P) — (P) so that given X € (P) and c,d € [0,n — 1] with 
X =cP+4dQ, itis easy to compute X = f(X) and@,d €[0,n—1] with X =tP+dQ. 
Furthermore, f should have the characteristics of a random function. 

The following is an example of a suitable iterating function. Let {51, S2,..., Sz} be 
a “random” partition of (P) into L sets of roughly the same size. Typical values for the 


“Suppose that an urn has n balls numbered 1 to n. The balls are randomly drawn, one at a time with 
replacement, from the urn. Then the expected number of draws before some ball is drawn for the second time 
is approximately ./7n/2. If n = 365 and the balls represent different days of the year, then the statement 
can be interpreted as saying that the expected number of people that have to be gathered in a room before 
one expects at least two of them to have the same birthday is approximately ./7365/2 © 24. This number is 
surprisingly small and hence the nomenclature “birthday paradox.” 
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number of branches L are 16 and 32. For example, if L = 32 then a point X € (P) can 
be assigned to S; if the five least significant bits of the x-coordinate of X represent the 
integer j — 1. We write H(X) = j if X € S; and call H the partition function. Finally, 
let aj,bj €r [0,n—1] for 1 < j < L. Then f : (P) > (P) is defined by 


f(X)=X+a;P+6;Q where j = H(X). 


Observe that if X =cP+dQ, then f(X) = X =¢CP+dQ where ¢ = c+aj; modn 
and d = d+b; mod n. 

Now, any point Xo € (P) determines a sequence {X;}j>0 of points where X; = 
ft (Xi-1) for i => 1. Since the set (P) is finite, the sequence will eventually collide 
and then cycle forever; that is, there is a smallest index t for which X; = X;45 for 
some s > 1, and then X; = X;_, for all i > t+ 5 (see Figure 4.1). Here, ¢ is called 


X41 

















Xo 


Figure 4.1. p-like shape of the sequence {X;} in Pollard’s rho algorithm, where t = tail length 
and s = cycle length. 


the tail length and s is called the cycle length of the sequence. If f is assumed to be 
a random function, then the sequence is expected to first collide after approximately 
/mn/2 terms. Moreover, the expected tail length is t © ./77/8 and the expected cycle 
length is s + ./7n/8. 

A collision, that is, points X;, Xj; with X; = X; andi ¥ j, can be found using 
Floyd’s cycle-finding algorithm wherein one computes pairs (X;, X2;) of points for 
i=1,2,3... until X; = X2;. After computing a new pair, the previous pair can be 
discarded; thus the storage requirements are negligible. The expected number & of such 
pairs that have to be computed before X; = X2; is easily seen to satisfy t <<k <t+ 
s. In fact, assuming that f is a random function, the expected value of k is about 
1.0308./n, and hence the expected number of elliptic curve group operations is about 
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3./n. The complete algorithm is presented as Algorithm 4.3. Note that the probability 
of the algorithm terminating with failure (i.e., d’ = d” in step 7) is negligible. 


Algorithm 4.3 Pollard’s rho algorithm for the ECDLP (single processor) 


INPUT: P € E(F,) of prime order n, Q € (P). 
OuTPUT: The discrete logarithm / = log p Q. 
1. Select the number L of branches (e.g., L = 16 or L = 32). 
2. Select a partition function H : (P) — {1,2,..., L}. 
3. For j from | to L do 
3.1 Select aj,b; €r [0,n—1]. 
3.2 Compute Rj =ajP+b;Q. 
4. Select c’,d’ €r [0,n—1] and compute X’ = c’P+d'Q. 
. Set X” — X', ce" << c!, d" <d’. 
6. Repeat the following: 
6.1 Compute j = H(X’). 
Set X’ <— X'+ Rj, c’ <—c' +a; mod n, d’<d'+b; mod n. 
6.2 For i from 1| to 2 do 
Compute j = H(X”). 
Set X” <— X" + R;, ce” <—c" +a; mod n, d” —d" +b; mod n. 
Until X’ = x”. 
7. If d’ =d" then return(“failure”); 
Else compute / = (c’ — c”)(d" — d’)~! mod n and return((). 


Nn 








Example 4.4 (Pollard’s rho algorithm for solving the ECDLP) Consider the elliptic 
curve defined over F279 by the equation: 
E: 7 =x? 4+x+44, 


The point P = (5, 116) € E(F 229) has prime order n = 239. Let Q = (155, 166) € (P). 
We wish to determine / = logp Q. 
We select the partition function H : (P) — {1,2,3,4} with L = 4 branches: 


A(x, y) =(« mod 4) +1, 
and the four triples 
[a1, 1, Ri] =[79, 163, (135, 117)] 
[a2, bz, R2] = [206, 19, (96, 97)] 
[a3, b3, R3] = [87, 109, (84, 62)] 
[a4, ba, Ra] = [219, 68, (72, 134)]. 


The following table lists the triples (c’,d’, X’) and (c”,d"”, X") computed in Algo- 
rithm 4.3 for the case (c’, d’) = (54, 175) in step 4. 
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eran | ea 


54. 175 (39,159) (39,159) 
34 4-~— (160, 9) (130,182) 
167 (130,182) (36, 97) 
39 (27,17) (108, 89) 
105 (36, 97) (223,153) 
29 (119,180) (167, 57) 
97 (108, 89) (57,105) 
21 (81,168) (185,227) 
40 (223,153) (197, 92) 
108 (9, 18) (194,145) 
127. (167, 57) 120 (223,153) 
195 (75,136) (167, 57) 
24 (57,105) 104 (57,105) 


1 
2 
3 
4 
5 
6 
7 
8 
9 





The algorithm finds 
192P +240 =213P+104Q, 


and hence 
I= (192—213)-(104— 24)~! mod 239 = 176. 


Parallelized Pollard’s rho attack 


Suppose now that M processors are available for solving an ECDLP instance. A naive 
approach would be to run Pollard’s rho algorithm independently on each processor 
(with different randomly chosen starting points Xo) until any one processor terminates. 
A careful analysis shows that the expected number of elliptic curve operations per- 
formed by each processor before one terminates is about 3,/n/M. Thus the expected 
speedup is only by a factor of JM. 

Van Oorschot and Wiener proposed a variant of Pollard’s rho algorithm that yields a 
factor M speedup when M processors are employed. The idea is to allow the sequences 
{X;}i>o generated by the processors to collide with one another. More precisely, each 
processor randomly selects its own starting point Xo, but all processors use the same 
iterating function f to compute subsequent points X;. Thus, if the sequences from two 
different processors ever collide, then, as illustrated in Figure 4.2, the two sequences 
will be identical from that point on. 

Floyd’s cycle-finding algorithm finds a collision in the sequence generated by a sin- 
gle processor. The following strategy enables efficient finding of a collision in the 
sequences generated by different processors. An easily testable distinguishing property 
of points is selected. For example, a point may be distinguished if the leading t bits of 
its x-coordinate are zero. Let 6 be the proportion of points in (P) having this distin- 
guishing property. Whenever a processor encounters a distinguished point, it transmits 
the point to a central server which stores it in a sorted list. When the server receives the 
same distinguished point for the second time, it computes the desired discrete logarithm 
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via (4.1) and terminates all processors. The expected number of steps per processor be- 
fore a collision occurs is (,/an/2)/M. A subsequent distinguished point is expected 
after 1/0 steps. Hence the expected number of elliptic curve operations performed by 
each processor before a collision of distinguished points is observed is 


1 /mn 1 42 

Wee (4.2) 
and this parallelized version of Pollard’s rho algorithm achieves a speedup that is lin- 
ear in the number of processors employed. Observe that the processors do not have 
to communicate with each other, and furthermore have limited communications with 
the central server. Moreover, the total space requirements at the central server can be 
controlled by careful selection of the distinguishing property. The complete algorithm 
is presented as Algorithm 4.5. Note that the probability of the algorithm terminating 
with failure (i.e., d’ = d” in step 7) is negligible. 


Algorithm 4.5 Parallelized Pollard’s rho algorithm for the ECDLP 


INPUT: P € E(F,) of prime order n, Q € (P). 
OUTPUT: The discrete logarithm / = log p Q. 
1. Select the number L of branches (e.g., L = 16 or L = 32). 
2. Select a partition function H : (P) — {1,2,..., LZ}. 
3. Select a distinguishing property for points in (P). 
4. For j from 1 to L do 
4.1 Select aj,bj; €r [0,n—1]. 
4.2 Compute Rj =a;P+b;Q. 
5. Each of the M processors does the following: 
5.1 Select c,d €r [0,n —1] and compute X =cP+dQ. 
5.2 Repeat the following: 
If X is distinguished then send (c,d, X) to the central server. 
Compute 7 = H(X). 
Set X <X +Rj;,c<c+a; mod n, and d<-d+b; mod n. 
Until the server receives some distinguished point Y for the second time. 
6. Let the two triples associated with Y be (c’,d’, Y) and (c",d", Y). 
7. If d’ =d" then return(“failure”); 
Else compute / = (c’ — c’)(d" — d’)~! mod n and return(J). 





Speeding Pollard’s rho algorithm using automorphisms 


Let y : (P) > (P) be a group automorphism, where P € E(F,) has order n. We 
assume that y can be computed very efficiently—significantly faster than a point ad- 
dition. Suppose that w has order f¢, that is, t is the smallest positive integer such that 
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Figure 4.2. Sequences generated by the parallelized Pollard’s rho algorithm. The sequences gen- 
erated by processors 3 and 4 first collide at X. The algorithm reports the collision at Y , the first 
subsequent distinguished point. 


w' (R) = R for all R € (P). The relation ~ on (P) defined by 
Ri ~ Ro if and only if Ry = wi (R2) for some j € [0,t— 1] 
is an equivalence relation. The equivalence class [R] containing a point R € (P) is 
[R] = (RB, W(R), W7(R), WY}, 


where / is the smallest positive divisor of t such that y! (R)=R. 

The idea behind the speedup is to modify the iterating function f so that it is defined 
on the equivalence classes (rather than just on the points in (P)). To achieve this, we 
define a canonical representative R for each equivalence class [R]. For example, R may 
be defined to be the point in [R] whose x-coordinate is the smallest when considered as 
an integer (with ties broken by selecting the point with a smaller y-coordinate). Then, 
we can define an iterating function g on the canonical representatives by 


g(R) = f(R). 
Suppose now that we know the integer 4 € [0, — 1] such that 
w(P)=AP. 


Then, since y% is a group automorphism, we have that w(R) = AR for all R € (P). 
Thus, if we know integers a and b such that X = aP + bQ, then we can efficiently 
compute integers a’ and b’ such that X = a’P + b'Q. Namely, if X = /(X), then 
a’ = Ja mod nand b! = dJb mod n. 
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The function g can now be used as the iterating function in the parallelized Pollard’s 
rho algorithm. The initial point in a sequence is Xp = Xo where Xp = apP +boQ 
and ag, bp €r [0,n — 1]. Subsequent terms of the sequence are computed iteratively: 
X} = g(X'_,) fori = 1. If most equivalence classes have size t, then the search space 
has size approximately n/t (versus n if equivalence classes are not employed) and thus 
the expected running time of the modified parallelized Pollard’s rho algorithm is 


aly Lae (4.3) 
MV 2t 0° ; 
a speedup by a factor of ./f over (4.2). 


Example 4.6 (using the negation map) The negation map w(P) = — P has order 2 and 
possesses the requisite properties described above. Thus, the parallelized Pollard’s rho 
algorithm that uses equivalence classes under the negation map has an expected running 





time of 
/tn I 
=. 4.4 
aM a (4.4) 


This is a speedup by a factor of 2 over (4.2) and is applicable to all elliptic curves. 


Example 4.7 (speeding Pollard’s rho algorithm for Koblitz curves) Recall from §3.4 
that a Koblitz curve E, (where a € {0, 1}) is an elliptic curve defined over F2. The 
Frobenius map t : E,(F2”) > E,(F2”), defined by t (co) = 00 and t(x, y) = (x?, vy"), 
is also a group automorphism of order m and can be computed efficiently since squar- 
ing is a cheap operation in Fy”. If P € Eq(F2”) has prime order n such that n* does 
not divide #E,(F2”), then t(P) € (P) and hence Tt is also a group automorphism of 
(P). Let uw = (—1)!~“. It follows from Note 3.72 that one of the two solutions A to the 
modular equation 
7 —pwrA+2=0 (mod n) 


satisfies t(P) = AP. Thus, t has the requisite properties, and parallelized Pollard’s 
rho algorithm that uses equivalence classes under the Frobenius map has an expected 
running time of 

1 /xn 1 


MY 2m" 6° 
Furthermore, the parallelized Pollard’s rho algorithm can exploit both the Frobenius 
map and the negation map to achieve an expected running time of 


ay are (4.5) 
2MV m 6° , 


for Koblitz curves, a speedup by a factor of 2m over (4.2). 
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Example 4.8 (solving a 113-bit ECDLP instance on the Internet) Let E be an elliptic 
curve defined over a prime field F,, and let P € E(F,) have prime order n. Suppose 
also that both p and n are 113-bit primes. Elliptic curves with these parameters would 
offer roughly the same security as provided by 56-bit DES. Assume that we have M = 
10, 000 computers available on the Internet to solve an instance of the ECDLP in (P), 
and that each computer can perform one iteration (of step 5.2 of Algorithm 4.5) in 
10 microseconds. If we select the distinguishing property so that @ = 2°, then the 
expected number of iterations performed by each computer before the logarithm is 
found is approximately 


(ous 
2+ 10000 


Hence, the expected running time before the logarithm is found is about 1045 days, or 
three years. Since the x-coordinate and associated (c,d) pair of a distinguished point 
can be stored in 12 32-bit words, the total space required for storing the distinguished 
points at the central server is about 


Um 


an : 
ar ee words * 3.8 Gigabytes. 


4.979 46.03: x10", 


One concludes from these calculations that while solving a 113-bit ECDLP requires 
significant resources, 113-bit ECC provides adequate security only for low-security 
short-term applications. 


Multiple logarithms 


We show how the distinguished points stored during the solution of one ECDLP in- 
stance in (P) using (parallelized) Pollard’s rho algorithm can be used to accelerate 
the solution of other ECDLP instances in (P). This property is relevant to the secu- 
rity of elliptic curve cryptographic systems because users typically share elliptic curve 
parameters E,IF,, P, and select their own public keys Q € (P). Thus, if one or more 
private keys can be found using Pollard’s rho algorithm, then finding other private keys 
becomes progressively easier. 

Suppose that / = logp Q has been computed. For each stored triple (c,d, X) as- 
sociated to distinguished points X encountered during the computation, the integer 
s =c-+dl mod n satisfies X = s P. Similarly, the integers r; = a; +bj;! mod n satisfy 
Rj =rj;P for 1 < j < L. Now, to compute /’ = log p Q’ where Q' € (P), each proces- 
sor computes the terms Y; of a random sequence with starting point Yo = c) P + dj Q’ 
where Co dy Er [0,n — 1], and the same iterating function f as before. For each dis- 
tinguished point Y encountered in the new sequences, a triple (c’,d’, Y) such that 
Y =c'P+d'Q’ is sent to the central server. A collision can occur between two new 
sequences or between a new sequence and an old one. In the former case, we have 


cP+d'O' =c"P+4+d'O', 
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whence I! = (c’ — c”)(d” —d’)~! mod n. In the latter case, we have 
c'/P+d'Q! =sP, 


whence I/ = (s —c’)(d’)~! mod n. 

The distinguished points collected during the first two ECDLP computations can 
similarly be used for the computation of the third ECDLP computation, and so on. The 
expected number W; of random walk steps before k ECDLP instances are iteratively 
solved in the manner described has been shown to be 


where T is the expected number of random walk steps to solve a single ECDLP in- 
stance. Thus, solving the second, third, and fourth ECDLP instances take only 50%, 
37%, 31%, respectively, of the time to solve the first instance. 

Concerns that successive ECDLP computations become easier can be addressed by 
ensuring that the elliptic curve parameters are chosen so that the first ECDLP instance 
is infeasible to solve. 


4.1.3. Index-calculus attacks 


Index-calculus algorithms are the most powerful methods known for computing dis- 
crete logarithms in some groups including the multiplicative group KG of a finite field, 
the jacobian Jc(F,) of a hyperelliptic curve C of high genus g defined over a finite 
field ',, and the class group of an imaginary quadratic number field. It is natural then 
to ask whether index-calculus methods can lead to subexponential-time algorithms for 
the ECDLP. 

We begin by outlining the index-calculus method in the general setting of an arbitrary 
cyclic group and illustrate how the method can be adapted to the multiplicative group of 
a prime field or binary field. We then explain why the natural ways to extend the index- 
calculus methods to elliptic curve groups are highly unlikely to yield subexponential- 
time algorithms for the ECDLP. 


The main idea behind index-calculus methods 


Let G be a cyclic group of order n generated by a. Suppose that we wish to find log, 6 
for B € G. The index-calculus method is the following. 


1. Factor base selection. Choose a subset S = {p1, p2,..-, pr} of G, called the fac- 
tor base, such that a “significant” proportion of elements in G can be efficiently 
expressed as a product of elements from S. The choice of S$ will depend on the 
characteristics of the particular group G. 
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2. Compute logarithms of elements in S. Select random integers k € [0, — 1] until 


a can be written as a product of elements in S: 


t 
of = I] p;', where cj > 0. (4.6) 
i=l 


Taking logarithms to the base a of both sides of (4.6) yields a linear equation 
where the unknowns are the logarithms of factor base elements: 


t 


k= ya log, pi (mod n). (4.7) 


i=1 


This procedure is repeated until slightly more than f such equations have been 
obtained. The resulting linear system of equations can then be solved to obtain 
log, pi for 1 <i <t. 


3. Compute log, f. Select random integers k until a B can be written as a product 
of elements in S: 


t 
ak B= lle where d; > 0. (4.8) 
i=! 
Taking logarithms to the base a of both sides of (4.8) yields the desired logarithm 
of p: 


t 
log, B= —k+ Sodi log, pi mod n. (4.9) 
i=l 

The running time of the index-calculus algorithm depends critically on the choice 
of the factor base S. There is also a trade-off in the size t of S. Larger t are preferred 
because then the probability of a random group element factoring over S is expected to 
be larger. On the other hand, smaller ¢ are preferred because then the number of linear 
equations that need to be collected is smaller. The optimum choice of t depends on the 
proportion of elements in G that factor over S. 

Consider now the case G = F*, the multiplicative group of a prime field. The ele- 
ments of pe can be regarded as the integers in [1, p — 1]. There is a natural choice for S, 
namely the prime numbers < B for some bound B. An element of F* factors over S if it 
is B-smooth, that is, all its prime factors are < B. The optimal factor base size depends 
on the distribution of B-smooth integers in [1, p — 1], and yields a subexponential-time 
algorithm for the DLP in K. The fastest variant of this algorithm is the number field 
sieve (NFS) and has an expected running time of L pt: 1.923]. 

Consider next the case G = Fm, the multiplicative group of a binary field. The el- 
ements of Fm can be regarded as the nonzero binary polynomials of degree less than 
m. Hence there is a natural choice for S, namely the irreducible binary polynomials of 
degree < B for some bound B. An element of F5m factors over S if it is B-smooth, 
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that is, all its irreducible factors have degree < B. The optimal factor base size depends 
on the distribution of B-smooth polynomials among the binary polynomials of degree 
< B, and yields a subexponential-time algorithm for the DLP in Fm. The fastest vari- 
ant of this algorithm is Coppersmith’s algorithm and has an expected running time of 
Lam [. c] for some constant c < 1.587. 


Failure of index-calculus attacks on the ECLDP 


Suppose that we wish to solve instances of the ECDLP in E(F,) where E : y= 
x? +ax +b is an elliptic curve defined over the prime field F p- For simplicity, suppose 
that E(F',) has prime order so that E(F,) = (P) for some P € E(F,). The most natural 
index-calculus approach would first lift E to a curve E defined over the field Q of 
rational numbers, that is, to a curve E: y? =x? +ax +b where a,b €Qanda= 
a mod p and b= b mod p. Then, the lift of a point R € E(F py) is a point Re EQ) 
whose coordinates reduce modulo p to those of R. This lifting process is analogous 
to the ones used in the index-calculus method described above for computing discrete 
logarithms in i and I3m, where elements of K, are “lifted” to integers in Z, and 
elements of Fin 1 are “lifted” to polynomials in Foz}. 

The celebrated Mordell-Weil Theorem states | that the group structure of E (Q) is 
Etors X Z’, where Ejors is the set of points in E (Q) of finite order, and r is a non- 
negative integer called the rank of E. Furthermore, a theorem of Mazur states that 
Etors has small size—in fact #E tors < 16. Thus a natural choice for the factor base i isa 
set of points P;, P2,..., P, such that Pi, Pr, os _ P, are linearly independent in E (Q). 
Relations of the form (4.6) can then be found by selecting multiples kP of P in E(F/) 
until the lift KP can be written as an integer linear combination of the basis points in 
E@: oo 7 

KP =cy Pi +c2Po.+:-:+c;P, 


Then, reducing the coordinates of the points modulo p yields a desired relation 
KP =cy Pi +c2Po.+:-:+c;P, 


in E(F,). 

There are two main reasons why this index-calculus approach is doomed to fail. The 
first is that no one knows how to efficiently lift points in E(F’,) to E (Q). Certainly, for 
a lifting procedure to be feasible, the lifted points should have small height. (Roughly 
speaking, the height of a point P € E(Q)is the number of bits needed to write down the 
coordinates of P.) However, it has been proven (under some reasonable assumptions) 
that the number of points of small height in any elliptic curve EQ) is extremely small, 
so that only an insignificant proportion of points in E(F,) can possibly be lifted to 
points of small height in E(Q)—this is the second reason for unavoidable failure of 
this index-calculus approach. 

For the ECDLP in elliptic curves E over non-prime fields F,, one could consider 
lifting E to an elliptic curve over a number field, or to an elliptic curve over a function 
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field. These approaches are also destined to fail for the same reasons as for the prime 
field case. 

Of course there may be other ways of applying the index-calculus methodology 
for solving the ECDLP. Thus far, no one has found an approach that yields a general 
subexponential-time (or better) algorithm for the ECDLP. 


4.1.4 Isomorphism attacks 


Let E be an elliptic curve defined over a finite field F,, and let P € E(F,) have prime 
order n. Let G be a group of order n. Since n is prime, (P) and G are both cyclic and 
hence isomorphic. If one could efficiently compute an isomorphism 


Wi (P)>G, (4.10) 


then ECDLP instances in (P) could be efficiently reduced to instances of the DLP in 
G. Namely, given P and Q € (P), we have 


log p O = logy p) W(Q). (4.11) 


Isomorphism attacks reduce the ECDLP to the DLP in groups G for which 
subexponential-time (or faster) algorithms are known. These attacks are special- 
purpose in that they result in ECDLP solvers that are faster than Pollard’s rho algorithm 
only for special classes of elliptic curves. The isomorphism attacks that have been 
devised are the following: 

(i) The attack on prime-field-anomalous curves reduces the ECDLP in an elliptic 

curve of order p defined over the prime field F’, to the DLP in the additive group 
F>, of integers modulo p. 

(ii) In the case gced(n, g) = 1, the Weil and Tate pairing attacks establish an isomor- 
phism between (P) and a subgroup of order n of the multiplicative group Fj« of 
some extension field Fy. 

(iii) The GHS Weil descent attack attempts to reduce the ECDLP in an elliptic curve 
defined over a binary field Fy” to the DLP in the jacobian of a hyperelliptic curve 
defined over a proper subfield of Fy. 

Since a polynomial-time algorithm is known for solving the DLP in Fo and since 
subexponential-time algorithms are known for the DLP in the multiplicative group of a 
finite field and for the jacobian of high-genus hyperelliptic curves, these isomorphism 
attacks can have important implications to the security of elliptic curve cryptographic 
schemes. We next discuss the cryptographic implications of and countermeasures to 
these attacks. 


Attack on prime-field-anomalous curves 


An elliptic curve E defined over a prime field F, is said to be prime-field-anomalous 
if #E(F,) = p. The group E (Fp) is cyclic since it has prime order, and hence E(F,) 
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is isomorphic to the additive group FF of integers modulo p. Now, the DLP in aa is 
the following: given p, a € ae a#0,andbe ee find 7 € [0, p — 1] such that la =b 
(mod p). Since 1 = ba~! mod p, the DLP in F’, can be efficiently solved by using the 
extended Euclidean algorithm (Algorithm 2.20) to compute a~! mod p. 


In 1997, Araki, Satoh, Semaev and Smart showed than an isomorphism 
wv: Ep) > FF 





can be efficiently computed for prime-field-anomalous elliptic curves. Consequently, 
the ECDLP in such curves can be efficiently solved and hence these elliptic curves 
must not be used in cryptographic protocols. Since it is easy to determine whether an 
elliptic curve E over a prime field F, is prime-field-anomalous (by checking whether 
#E(F,) = p), the Araki-Satoh-Semaev-Smart attack can easily be circumvented in 
practice. 


Weil and Tate pairing attacks 


Suppose now that the prime order n of P € E(F,) satisfies gcd(n, g) = 1. Let k be the 
smallest positive integer such that g* = 1 (mod n); the integer k is the multiplicative 
order of g modulo n and therefore is a divisor of n — 1. Since n divides q* — 1, the 
multiplicative group 7x of the extension field Fj has a unique subgroup G of order n. 
The Weil pairing attack constructs an isomorphism from (P) to G when the additional 
constraint n { (q — 1) is satisfied, while the Tate pairing attack constructs an isomor- 
phism between (P) and G without requiring this additional constraint. The integer k is 
called the embedding degree. 

For most elliptic curves one expects that k ~ n. In this case the Weil and Tate pairing 
attacks do not yield an efficient ECDLP solver since the finite field F,« has exponential 
size relative to the size of the ECDLP parameters. (The ECDLP parameters have size 
O (log q) bits, while elements of F'j* have size O(klog q) bits.) However, some special 
elliptic curves do have small embedding degrees k. For these curves, the Weil and 
Tate pairing reductions take polynomial time. Since subexponential-time algorithms 
are known for the DLP in F<, this results in a subexponential-time algorithm for the 
ECDLP in these special elliptic curves. 

The special classes of elliptic curves with small embedding degree include super- 
singular curves (Definition 3.10) and elliptic curves of trace 2 (with #E(F,) = q — 1). 
These curves have k < 6 and consequently should not be used in the elliptic curve 
protocols discussed in this book unless the underlying finite field is large enough so 
that the DLP in Fx is considered intractable. We note that constructive applications 
have recently been discovered for supersingular elliptic curves, including the design of 
identity-based public-key encryption schemes (see page 199 for references). 

To ensure that an elliptic curve E defined over F, is immune to the Weil and Tate 
pairing attacks, it is sufficient to check that n, the order of the base point P € E(F), 
does not divide g* — 1 for all small k for which the DLP in F*x is considered tractable. 
If n > 2!) then it suffices to check this condition for all k € [1, 20]. 
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Weil descent 


Suppose that / is a non-supersingular elliptic curve defined over a binary field K = 
Fo», and suppose that #E (F2”) = nh where n is prime and h is small (e.g., h = 2 or 
h =4). In 1998, Frey proposed using Weil descent to reduce the ECDLP in E(F”) 
to the DLP in the jacobian variety of a curve of larger genus defined over a proper 
subfield k = F2! of K. Let d = m/l. In Frey’s method, referred to as the Weil descent 
attack methodology, one first constructs the so-called Weil restriction Wx, of scalars 
of E, which is a d-dimensional abelian variety over k. One then attempts to find a curve 
C defined over k in Wx x such that (i) there are algorithms for solving the DLP in the 
jacobian Jc(k) of C over k that are faster than Pollard’s rho method; and (ii) ECDLP 
instances in E(K) can be efficiently mapped to DLP instances in Jc(k). 

Gaudry, Hess and Smart (GHS) showed how the Weil restriction Wx, can be in- 
tersected with n — 1 hyperplanes to eventually obtain a hyperelliptic curve C of genus 
g defined over k from an irreducible component in the intersection. Furthermore, they 
gave an efficient algorithm that (in most cases) reduces ECDLP instances in E(K) 
to instances of the hyperelliptic curve discrete logarithm problem (HCDLP) in Jc(k). 
Now, the Enge-Gaudry index-calculus algorithm for the HCDLP in a genus-g hyper- 
elliptic curve over F, has a subexponential expected running time of Lye [/2] bit 
operations for g/logqg — oo. Thus, provided that g is not too large, the GHS attack 
yields a subexponential-time algorithm for the original ECDLP. 

It was subsequently shown that the GHS attack fails for all cryptographically inter- 
esting elliptic curves over F2” for all prime m € [160, 600]. Note that such fields have 
only one proper subfield, namely IF. In particular, it was shown that the hyperelliptic 
curves C produced by the GHS attack either have genus too small (whence Jc (F2) is 
too small to yield any non-trivial information about the ECDLP in E(F2~)), or have 
genus too large (g > 2'© — 1, whence the HCDLP in Jc (F2) is infeasible using known 
methods for solving the HCDLP). The GHS attack has also been shown to fail for all 
elliptic curves over certain fields Fo” where m € [160, 600] is composite; such fields 
include F169, F209 and F247. 

However, the GHS attack is effective for solving the ECDLP in some elliptic curves 
over F2n where m € [160, 600] is composite. For example, the ECDLP in approxi- 
mately 2” of the 2! isomorphism classes of elliptic curves over F161 can be solved 
in about 24% steps by using the GHS attack to reduce the problem to an instance of 
the HCDLP in a genus-8 hyperelliptic curve over the subfield F423. Since Pollard’s rho 
method takes roughly 2°° steps for solving the ECDLP in cryptographically interesting 
elliptic curves over F101, the GHS attack is deemed to be successful for the 2” elliptic 
curves. 

Let Fy, where m € [160, 600] is composite, be a binary field for which the GHS 
attack exhibits some success. Then the proportion of elliptic curves over F 2 that suc- 
cumb to the GHS attack is relatively small. Thus, if one selects an elliptic curve over 
Fm at random, then there is a very high probability that the elliptic curve will resist 
the GHS attack. However, failure of the GHS attack does not imply failure of the Weil 
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descent attack methodology—there may be other useful curves which lie on the Weil 
restriction that were not constructed by the GHS method. Thus, to account for poten- 
tial future developments in the Weil descent attack methodology, it seems prudent to 
altogether avoid using elliptic curves over F2” where m is composite. 


4.1.5 Related problems 


While hardness of the ECDLP is necessary for the security of any elliptic curve cryp- 
tographic scheme, it is generally not sufficient. We present some problems related to 
the ECDLP whose hardness is assumed in the security proofs for some elliptic curve 
protocols. All these problems can be presented in the setting of a general cyclic group, 
however we restrict the discussion to elliptic curve groups. 


Elliptic curve Diffie-Hellman problem 


Definition 4.9 The (computational) elliptic curve Diffie-Hellman problem (ECDHP) 
is: given an elliptic curve E defined over a finite field Fj, a point P € E(F,) of order 
n, and points A=aP, B=bP é€ (P), find the point C =abP. 


If the ECDLP in (P) can be efficiently solved, then the ECDHP in (P) can also be 
efficiently solved by first finding a from (P, A) and then computing C = aB. Thus the 
ECDHP is no harder than the ECDLP. It is not known whether the ECDHP is equally 
as hard as the ECDLP; that is, no one knows how to efficiently solve the ECDLP given 
a (hypothetical) oracle that efficiently solves the ECDHP. However, the equivalence of 
the ECDLP and ECDHP has been proven in some special cases where the ECDLP is 
believed to be hard, for example when n is prime and all the prime factors of n — 1 are 
small. The strongest evidence for the hardness of the ECDHP comes from a result of 
Boneh and Lipton who proved (under some reasonable assumptions about the distribu- 
tion of smooth integers in a certain interval) that if n is prime and the ECDLP cannot be 
solved in L, [5. c] subexponential time (for some constant c), then the ECDHP cannot 
be solved in Ly, [5. c — 2] subexponential time. Further evidence for the hardness of the 
ECDHP comes from Shoup’s lower bound of ./n for the Diffie-Hellman problem in 
generic groups of prime order n. 


Elliptic curve decision Diffie-Hellman problem 


The ECDHP is concerned with computing the Diffie-Hellman secret point abP given 
(P,aP,bP). For the security of some elliptic curve protocols, it may be necessary 
that an adversary does not learn any information about abP. This requirement can 
be formalized by insisting that the adversary cannot distinguish abP from a random 
element in (P). 
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Definition 4.10 The elliptic curve decision Diffie-Hellman problem (ECDDHP) is: 
given an elliptic curve E defined over a finite field F,, a point P € E(F,) of order 
n, and points A=aP, B =bP, and C =cP € (P), determine whether C = abP or, 
equivalently, whether c= ab (mod n). 


If the ECDHP in (P) can be efficiently solved, then the ECDDHP in (P) can also 
be efficiently solved by first finding C’ = abP from (P, A, B) and then comparing C’ 
with C. Thus the ECDDHP is no harder than the ECDHP (and also the ECDLP). The 
only hardness result that has been proved for ECDDHP is Shoup’s lower bound of ./n 
for the decision Diffie-Hellman problem in generic groups of prime order 7. 


4.2 Domain parameters 


Domain parameters for an elliptic curve scheme describe an elliptic curve E defined 
over a finite field F,, a base point P ¢ E(F,), and its order n. The parameters should 
be chosen so that the ECDLP is resistant to all known attacks. There may also be other 
constraints for security or implementation reasons. Typically, domain parameters are 
shared by a group of entities; however, in some applications they may be specific to 
each user. For the remainder of this section we shall assume that the underlying field is 
either a prime field (§2.2), a binary field (§2.3), or an optimal extension field (§2.4). 


Definition 4.11 Domain parameters D = (q, FR, S,a,b, P,n,h) are comprised of: 

1. The field order q. 

2. An indication FR (field representation) of the representation used for the 
elements of Fy. 

3. A seed S if the elliptic curve was randomly generated in accordance with 
Algorithm 4.17, Algorithm 4.19, or Algorithm 4.22. 

4. Two coefficients a,b € F, that define the equation of the elliptic curve E over 
Fy (ie., y* =x? +ax +5 in the case of a prime field or an OEF, and y* +xy = 
x? +ax? +b in the case of a binary field). 


5. Two field elements xp and yp in Fg that define a finite point P = (xp, yp) € 
E(F,) in affine coordinates. P has prime order and is called the base point. 


6. The order n of P. 
7. The cofactor h = #E(Fq)/n. 


Security constraints In order to avoid the Pohlig-Hellman attack (§4.1.1) and Pol- 
lard’s rho attack (§4.1.2) on the ECDLP, it is necessary that #E(F,) be divisible by 
a sufficiently large prime n. At a minimum, one should have n > 2!©°, Having fixed 
an underlying field F,,, maximum resistance to the Pohlig-Hellman and Pollard’s rho 
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attacks is attained by selecting E so that #£(F,) is prime or almost prime, that is, 
#E (F7) = hn where n is prime and h is small (e.g., h = 1, 2,3 or 4). 

Some further precautions should be exercised to assure resistance to isomorphism 
attacks (§4.1.4). To avoid the attack on prime-field-anomalous curves, one should ver- 
ify that #E (F,) 4 qg. To avoid the Weil and Tate pairing attacks, one should ensure that 
n does not divide qk — 1 for all 1 < k < C, where C is large enough so that the DLP 
in F*c is considered intractable (if n > 2'©° then C = 20 suffices). Finally, to ensure 
resistance to the Weil descent attack, one may consider using a binary field Fy» only if 
m is prime. 


Selecting elliptic curves verifiably at random A prudent way to guard against at- 
tacks on special classes of curves that may be discovered in the future is to select the 
elliptic curve E at random subject to the condition that #E (IF, ) is divisible by a large 
prime. Since the probability that a random curve succumbs to one of the known special- 
purpose isomorphism attacks is negligible, the known attacks are also prevented. A 
curve can be selected verifiably at random by choosing the coefficients of the defining 
elliptic curve as the outputs of a one-way function such as SHA-1 according to some 
pre-specified procedure. The input seed S to the function then serves as proof (under 
the assumption that SHA-1 cannot be inverted) that the elliptic curve was indeed gen- 
erated at random. This provides some assurance to the user of the elliptic curve that 
it was not intentionally constructed with hidden weaknesses which could thereafter be 
exploited to recover the user’s private keys. 


4.2.1 Domain parameter generation and validation 


Algorithm 4.14 is one way to generate cryptographically secure domain parameters— 
all the security constraints discussed above are satisfied. A set of domain parameters 
can be explicitly validated using Algorithm 4.15. The validation process proves that 
the elliptic curve in question has the claimed order and resists all known attacks on 
the ECDLP, and that the base point has the claimed order. An entity who uses elliptic 
curves generated by untrusted software or parties can use validation to be assured that 
the curves are cryptographically secure. 
Sample sets of domain parameters are provided in §A.2. 


Note 4.12 (restrictions on n and L in Algorithms 4.14 and 4.15) 


(i) Since n is chosen to satisfy n > 24, the condition L > 160 in the input of 
Algorithm 4.14 ensures that n > 2!©9, 


(ii) The condition L < [log,q]| ensures that 24 < q whence an elliptic curve E 
over F, with order #E(IF,) divisible by an L-bit prime should exist (recall that 
#E(F,) © q). In addition, if g = 2” then L should satisfy L < |log,q|—1 
because #E (IF 2”) is even (cf. Theorem 3.18(iii)). 
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(iii) The condition n > 4,/q guarantees that E(F,) has a unique subgroup of order 
n because #E (Fy) < (/q+ 1}? by Hasse’s Theorem (Theorem 3.7) and so n2 
does not divide #£(F,). Furthermore, since hn = #E(F,) must lie in the Hasse 
interval, it follows that there is only one possible integer h such that #E(F,) = 
hn, namely h = |(./q¢ + 1)7/n]. 


Note 4.13 (selecting candidate elliptic curves) In Algorithm 4.14, candidate elliptic 
curves E are generated verifiably at random using the procedures specified in §4.2.2. 
The orders #E (IF, ) can be determined using the SEA point counting algorithm for the 
prime field or OEF case, or a variant of Satoh’s point counting algorithm for the binary 
field case (see §4.2.3). The orders #E(F,) of elliptic curves E over Fy are roughly 
uniformly distributed in the Hasse interval [q + 1 —2,/q,q+1+2./q] if F, is a prime 
field or an OEF, and roughly uniformly distributed among the even integers in the Hasse 
interval if Fg is a binary field. Thus, one can use estimates of the expected number of 
primes in the Hasse interval to obtain fairly accurate estimates of the expected number 
of elliptic curves tried until one having prime or almost-prime order is found. The 
testing of candidate curves can be accelerated by deploying an early-abort strategy 
which first uses the SEA algorithm to quickly determine #£ (F,) modulo small primes 
/, rejecting those curves where #E(F,) is divisible by /. Only those elliptic curves 
which pass these tests are subjected to a full point counting algorithm. 

An alternative to using random curves is to select a subfield curve or a curve using 
the CM method (see §4.2.3). Algorithm 4.14 can be easily modified to accommodate 
these selection methods. 


Algorithm 4.14 Domain parameter generation 


INPUT: A field order qg, a field representation FR for F, security level L satisfying 
160 < L < [logy q] and 2’ > 4,/q. 
OUTPUT: Domain parameters D = (g, FR, S,a,b, P,n,h). 
1. Select a,b € Fy verifiably at random using Algorithm 4.17, 4.19 or 4.22 if Fy is 
a prime field, binary field, or OEF, respectively. Let S be the seed returned. Let 
E be y* = x>+ax +b in the case IF, is a prime field or an OEF, and ytxy= 
x? +. ax* +b in the case F, is a binary field. 
2. Compute N = #E(F,) (see §4.2.3). 
3. Verify that N is divisible by a large prime n satisfying n > 2°. If not, then go to 
step |. 
. Verify that n does not divide g* — 1 for 1 < k < 20. If not, then go to step |. 
. Verify that n # q. If not, then go to step 1. 
. Seth<—N/n. 
. Select an arbitrary point P’ € E(F,) and set P =hP’. Repeat until P 4 oo. 
. Return(g, FR, S$,a,b, P,n,h). 


ON DNDN 
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Algorithm 4.15 Explicit domain parameter validation 


INPUT: Domain parameters D = (g, FR, S,a,b, P,n,h). 
OUTPUT: Acceptance or rejection of the validity of D. 
1. Verify that g is a prime power (g = p” where p is prime and m > 1). 
2. If p =2 then verify that m is prime. 
3. Verify that FR is a valid field representation. 
4. Verify that a,b,xp, yp (where P = (x, y)) are elements of Fy (i.e., verify that 
they are of the proper format for elements of F,). 
5. Verify that a and b define an elliptic curve over Fg (i.e., 4a +27b? £0 for fields 
with p > 3, and b £0 for binary fields). 
6. If the elliptic curve was randomly generated then 
6.1 Verify that S is a bit string of length at least / bits, where / is the bitlength 
of the hash function H. 
6.2 Use Algorithm 4.18 (for prime fields), Algorithm 4.21 (for binary fields) 
or Algorithm 4.23 (for OEFs) to verify that a and b were properly derived 
from S. 
7. Verify that P 4 oo. 
8. Verify that P satisfies the elliptic curve equation defined by a, b. 
9. Verify that n is prime, that n > 2160 and that n > 4/q. 
10. Verify that n P = oo. 
11. Compute h’ = |(./¢+1)?/n] and verify that h =’. 
12. Verify that n does not divide g* — 1 for 1 < k < 20. 
13. Verify thatn ~ q. 
14. If any verification fails then return(“Invalid”); else return(‘‘Valid’”’). 


4.2.2 Generating elliptic curves verifiably at random 


Algorithms 4.17, 4.19 and 4.22 are specifications for generating elliptic curves 
verifiably at random over prime fields, binary fields, and OEFs, respectively. The corre- 
sponding verification procedures are presented as Algorithms 4.18, 4.21 and 4.23. The 
algorithms for prime fields and binary fields are from the ANSI X9.62 standard. 


Note 4.16 (explanation of the parameter r in Algorithms 4.17 and 4.22) Suppose that 
IF, is a finite field of characteristic > 3. If elliptic curves EF, : yr =xtaxth, 
and E>: ve =x? +ayx +b defined over IF, are isomorphic over Fy and satisfy 
b; £0 (so bz £0), then a; /bi = ae Bs. The singular elliptic curves, that is, the curves 
E: y? =x? +ax +b for which 4a? + 27b? = 0 in Fj, are precisely those which ei- 
ther have a = 0 and b = 0, or a3/b* = —27/4. If r € Fy with r 40 and r 4 —27/4, 
then there are precisely two isomorphism classes of curves E : y? = x* + ax +b with 
a} /b* =r in I. Hence, there are essentially only two choices for (a,b) in step 10 
of Algorithms 4.17 and 4.22. The conditions r 4 0 and r 4 —27/4 imposed in step 9 


176 


4. Cryptographic Protocols 


of both algorithms ensure the exclusion of singular elliptic curves. Finally, we men- 
tion that this method of generating curves will never produce the elliptic curves with 
a =0, b £0, nor the elliptic curves with a 4 0, b = 0. This is not a concern because 
such curves constitute a negligible fraction of all elliptic curves, and therefore are un- 
likely to ever be generated by any method which selects an elliptic curve uniformly at 
random. 


Generating random elliptic curves over prime fields 


Algorithm 4.17 Generating a random elliptic curve over a prime field F, 


INPUT: A prime p > 3, and an /-bit hash function H. 
OuTPUwT: A seed S, and a, b € Fy defining an elliptic curve E : y? =x3 +ax+b. 


1. 
2; 
3. 


4. 
5. 
. Fori from | to s do: 


Oo 


Set t <[log, p],s<L@—1)/lJ,v<t—sl. 

Select an arbitrary bit string S of length g > / bits. 

Compute h = H(S), and let ro be the bit string of length v bits obtained by taking 
the v rightmost bits of h. 

Let Ro be the bit string obtained by setting the leftmost bit of ro to 0. 

Let z be the integer whose binary representation is S. 


6.1 Let s; be the g-bit binary representation of the integer (z-++i) mod 28. 
6.2 Compute R; = H(s;). 


. Let R= Roll Rill --- Rs. 

. Let r be the integer whose binary representation is R. 

. Ifr =O orif 4r +27 =0 (mod p) then go to step 2. 

. Select arbitrary a, b € Fp, not both 0, such that r - b? =a? (mod p). 
. Return(S, a, b). 


Algorithm 4.18 Verifying that an elliptic curve over F,, was randomly generated 
INPUT: Prime p > 3, /-bit hash function H, seed S of bitlength g >/, anda,be Fy, 


defining an elliptic curve E : y? = x3+ax +b. 


OUTPUT: Acceptance or rejection that E was generated using Algorithm 4.17. 


1. 
2, 


a 


— 


Set t <[log, p],s<L(@t—1)/lJ, v<t—sl. 
Compute h = H(S), and let ro be the bit string of length v bits obtained by taking 
the v rightmost bits of h. 


. Let Ro be the bit string obtained by setting the leftmost bit of ro to 0. 
. Let z be the integer whose binary representation is S. 
. Fori from 1 to s do: 


5.1 Let s; be the g-bit binary representation of the integer (z+i) mod 28. 
5.2 Compute R; = H(s;). 


. Let R= Roll Rill --- || Rs. 
. Let r be the integer whose binary representation is R. 
_ Ifr-b? =a? (mod p) then return(“Accept’); else return(“Reject’’). 
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Generating random elliptic curves over binary fields 


Algorithm 4.19 Generating a random elliptic curve over a binary field Fa 


INPUT: A positive integer m, and an /-bit hash function H. 
OUTPUT: Seed S, and a, b € F>» defining an elliptic curve E : y?+-xy =x +ax* +b. 
1. Sets <|(m—1)/l|, v<—m-—sl. 
2. Select an arbitrary bit string S of length g > / bits. 
3. Compute h = H(S), and let bo be the bit string of length v bits obtained by taking 
the v rightmost bits of h. 
4. Let z be the integer whose binary representation is S. 
5. Fori from 1 to s do: 
5.1 Let s; be the g-bit binary representation of the integer (z+i) mod 28. 
5.2 Compute b; = H(s;). 
- Let b = do || bi || --- lds. 
. If b=0 then go to step 2. 
. Select arbitrary a € Fo. 
. Return(S, a, b). 


\o con OD 


Note 4.20 (selection of a in Algorithm 4.19) By Theorem 3.18(ii) on the isomorphism 
classes of elliptic curves over F2, it suffices to select a from {0, y} where y € Fam 
satisfies Tr(y) = 1. Recall also from Theorem 3.18(iii) that #E (F2~) is always even, 
while if a = 0 then #E (F) is divisible by 4. 


Algorithm 4.21 Verifying that an elliptic curve over Fz was randomly generated 


INPUT: Positive integer m, /-bit hash function H, seed S of bitlength g >J/, anda, be 
F defining an elliptic curve E : y? ++xy =x*> +ax* +b. 
OUTPUT: Acceptance or rejection that E was generated using Algorithm 4.19. 
1. Set s<|(m—1)/1|, v<—m-—sl. 
2. Compute h = H(S), and let bo be the bit string of length v bits obtained by taking 
the v rightmost bits of h. 
. Let z be the integer whose binary representation is S. 
4. Fori from 1 to s do: 
4.1 Let s; be the g-bit binary representation of the integer (z¢++7) mod 2%. 
4.2 Compute b; = H(s;). 
5. Let b’ = bo|| di || «++ |Ds. 
6. If b' =b then return(“Accept”); else return(“Reject”). 


ies) 
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Generating random elliptic curves over OEFs 


Algorithm 4.22 Generating a random elliptic curve over an OEF Fy 


INPUT: A prime p > 3, reduction polynomial f(x) ¢ F [x] of degree m, and an /-bit 


hash function H. 


OuTPUT: A seed S, and a, b € Fp» defining an elliptic curve E : y? =x3+ax+b. 


1. 
2: 
3: 


Set W <[log, p], t<-W-m,s<—((t—1)/l, v<t—sl. 

Select an arbitrary bit string S of length g > / bits. 

Compute h = H(S), and let 7p be the bit string of length v bits obtained by 
taking the v rightmost bits of h. 


. Let z be the integer whose binary representation is S. 
. Fori from 1 to s do: 


5.1 Let s; be the g-bit binary representation of the integer (z ++i) mod 28. 
5.2 Compute 7; = H(s;). 


. Write Zo || Ti || --+ [| Zs = Rm—1 || «+ || Ri || Ro where each R; is a W-bit string. 


. Foreachi, 0 <i <m-—1, letr; = R; mod p, where R; denotes the integer whose 


binary representation is Rj. 


. Let r be the element (7)—1,...,71, 70) in the OEF defined by p and f(x). 
. Ifr =0 or if 4r +27 = 0 in F then go to step 2. 

. Select arbitrary a, b € Fy”, not both 0, such that r - b? =a? in F pM. 

. Return(S, a, b). 


Algorithm 4.23 Verifying that an elliptic curve over Fp» was randomly generated 


INPUT: Prime p > 3, reduction polynomial f(x) € F,[x] of degree m, /-bit hash func- 


tion H, seed S of bitlength g > /, and a,b € Fp» defining an elliptic curve 
E:y?=x3+ax+b. 


OUTPUT: Acceptance or rejection that E was generated using Algorithm 4.22. 


1. 
2, 


ies) 


Set W <[log, p], t<-W-m,s<—([(t—1)/l, v<t—sl. 
Compute h = H(S), and let 7p be the bit string of length v bits obtained by 
taking the v rightmost bits of h. 


. Let z be the integer whose binary representation is S. 
. Fori from 1 to s do: 


4.1 Let s; be the g-bit binary representation of the integer (++i) mod 28. 
4.2 Compute 7; = H(s;). 


. Write Zo || Ti || --+ || Zs = Rm—1 || «+ || Ri || Ro where each R; is a W-bit string. 


. Foreachi, 0 <i <m-—1, letr; = R; mod p, where R; denotes the integer whose 


binary representation is Rj. 


. Let r be the element (7)—1,...,71, 70) in the OEF defined by p and f(x). 
_Ifr-b =a inF pm then return(“Accept’); else return(“Reject”). 
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4.2.3 Determining the number of points on an elliptic curve 


As discussed in the introduction to §4.2, the order #£ (I, ) of an elliptic curve E used 
in a cryptographic protocol should satisfy some constraints imposed by security con- 
siderations. Thus, determining the number of points on an elliptic curve is an important 
ingredient of domain parameter generation. A naive algorithm for point counting is to 
find, for each x € Fy, the number of solutions y € F, to the Weierstrass equation for 
E. This method is clearly infeasible for field sizes of cryptographic interest. In prac- 
tice, one of the following three techniques is employed for selecting an elliptic curve 
of known order. 


Subfield curves Let g = p'“, where d > 1. One selects an elliptic curve E defined 
over F’,,, counts the number of points in E(F ,:) using a naive method, and then easily 
determines #E (IF, ) using Theorem 3.11. The group used for the cryptographic applica- 
tion is E (IF, ). Since the elliptic curve E is defined over a proper subfield F of Fg, it is 
called a subfield curve. For example, Koblitz curves studied in §3.4 are subfield curves 
with p = 2 and/ = 1. Since #E(F ple) divides #E(F,) for all divisors c of d and an 
elliptic curve of prime or almost-prime order is desirable, / should be small (preferably 
! = 1) and d should be prime. 


The complex-multiplication (CM) method In this method, one first selects an order 
N that meets the required security constraints, and then constructs an elliptic curve with 
that order. For elliptic curves over prime fields, the CM method is also called the Atkin- 
Morain method; for binary fields it is called the Lay-Zimmer method. The CM method 
is very efficient provided that the finite field order q and the elliptic curve order N = 
q+1-t are selected so that the complex multiplication field Q(,/t? —4q) has small 
class number. Cryptographically suitable curves over 160-bit fields can be generated in 
one minute on a workstation. In particular, the CM method is much faster than the best 
algorithms known for counting the points on randomly selected elliptic curves over 
prime fields and OEFFs. For elliptic curves over binary fields, the CM method has been 
superseded by faster point counting algorithms (see below). 

Since the ECDLP is not known to be any easier for elliptic curves having small class 
number, elliptic curves generated using the CM method appear to offer the same level 
of security as those generated randomly. 


Point counting In 1985, Schoof presented the first polynomial-time algorithm 
for computing #E(F,) for an arbitrary elliptic curve E. The algorithm computes 
#E(F,) mod / for small prime numbers /, and then determines #£ (Fj) using the Chi- 
nese Remainder Theorem. It is inefficient in practice for values of qg of practical interest, 
but was subsequently improved by several people including Atkin and Elkies resulting 
in the so-called Schoof-Elkies-Atkin (SEA) algorithm. The SEA algorithm, which is the 
best algorithm known for counting the points on arbitrary elliptic curves over prime 
fields or OEFs, takes a few minutes for values of g of practical interest. Since it can 
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very quickly determine the number of points modulo small primes /, it can be used in 
an early-abort strategy to quickly eliminate candidate curves whose orders are divisible 
by a small prime number. 

In 1999, Satoh proposed a fundamentally new method for counting the number of 
points over finite fields of small characteristic. Variants of Satoh’s method, including 
the Satoh-Skjernaa-Taguchi (SST) and the Arithmetic Geometric Mean (AGM) algo- 
rithms, are extremely fast for the binary field case and can find cryptographically 
suitable elliptic curves over F'y163 in just a few seconds on a workstation. 


4.3 Key pairs 


An elliptic curve key pair is associated with a particular set of domain parameters 
D = (qg, FR, S,a,b, P,n,h). The public key is a randomly selected point Q in the 
group (P) generated by P. The corresponding private key is d = logp Q. The entity 
A generating the key pair must have the assurance that the domain parameters are 
valid (see §4.2). The association between domain parameters and a public key must 
be verifiable by all entities who may subsequently use A’s public key. In practice, 
this association can be achieved by cryptographic means (e.g., a certification authority 
generates a certificate attesting to this association) or by context (e.g., all entities use 
the same domain parameters). 


Algorithm 4.24 Key pair generation 


INPUT: Domain parameters D = (g, FR, S,a,b, P,n,h). 
OUTPUT: Public key Q, private key d. 

1. Select d Er [1,n—1]. 

2. Compute Q =dP. 

3. Return(Q,d). 


Observe that the problem of computing a private key d from the public key Q is pre- 
cisely the elliptic curve discrete logarithm problem. Hence it is crucial that the domain 
parameters D be selected so that the ECDLP is intractable. Furthermore, it is important 
that the numbers d generated be “random” in the sense that the probability of any par- 
ticular value being selected must be sufficiently small to preclude an adversary from 
gaining advantage through optimizing a search strategy based on such probability. 


Public key validation 


The purpose of public key validation is to verify that a public key possesses certain 
arithmetic properties. Successful execution demonstrates that an associated private key 
logically exists, although it does not demonstrate that someone has actually computed 
the private key nor that the claimed owner actually possesses it. Public key validation is 
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especially important in Diffie-Hellman-based key establishment protocols where an en- 
tity A derives a shared secret k by combining her private key with a public key received 
from another entity B, and subsequently uses k in some symmetric-key protocol (e.g., 
encryption or message authentication). A dishonest B might select an invalid public 
key in such a way that the use of k reveals information about A’s private key. 


Algorithm 4.25 Public key validation 


INPUT: Domain parameters D = (q, FR, S,a,b, P,n,h), public key Q. 
OUTPUT: Acceptance or rejection of the validity of Q. 
1. Verify that Q #00. 
2. Verify that xg and yg are properly represented elements of F, (e.g., integers in 
the interval [0, g — 1] if F, is a prime field, and bit strings of length m bits if F, 
is a binary field of order 2’). 
3. Verify that Q satisfies the elliptic curve equation defined by a and b. 
4. Verify that nQ = oo. 
5. If any verification fails then return(“Invalid’”’); else return(“Valid’’). 


There may be much faster methods for verifying that nQ = oo than performing an 
expensive point multiplication n Q. For example, if h = 1 (which is usually the case for 
elliptic curves over prime fields that are used in practice), then the checks in steps 1, 2 
and 3 of Algorithm 4.25 imply that nQ = oo. In some protocols the check that n Q = oo 
may be omitted and either embedded in the protocol computations or replaced by the 
check that hQ 4 oo. The latter check guarantees that Q is not in a small subgroup of 
E(Fq) of order dividing h. 


Algorithm 4.26 Embedded public key validation 


INPUT: Domain parameters D = (q, FR, S,a,b, P,n,h), public key Q. 
OUTPUT: Acceptance or rejection of the (partial) validity of Q. 
1. Verify that Q # oo. 
2. Verify that xg and yg are properly represented elements of Fy, (e.g., integers in 
the interval [0, q — 1] if F, is a prime field, and bit strings of length m bits if Fy 
is a binary field of order 2”). 
3. Verify that Q lies on the elliptic curve defined by a and b. 
4. If any verification fails then return(“Invalid’’); else return(‘‘Valid’’). 


Small subgroup attacks 


We illustrate the importance of the checks in public key validation by describing a 
small subgroup attack on a cryptographic protocol that is effective if some checks are 
not performed. Suppose that an entity A’s key pair (Q,d) is associated with domain 
parameters D = (q, FR, S,a,b, P,n,h). In the one-pass elliptic curve Diffie-Hellman 
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(ECDH) protocol, a second entity B who has authentic copies of D and Q selects 
r €r [1,n—1] and sends R=rP to A. A then computes the point K = dR, while 
B computes the same point K =rQ. Both A and B derive a shared secret key k = 
KDF(K), where KDF is some key derivation function. Note that this key establishment 
protocol only provides unilateral authentication (of A to B), which may be desirable 
in some applications such as the widely deployed SSL protocol where the server is 
authenticated to the client but not conversely. We suppose that A and B subsequently 
use the key k to authenticate messages for each other using a message authentication 
code algorithm MAC. 

Suppose now that A omits the check that nQ = oo in public key validation (step 4 
in Algorithm 4.25). Let / be a prime divisor of the cofactor h. In the small subgroup 
attack, B sends to A a point R of order / (instead of a point in the group (P) of order 7). 
A computes K =dR and k = KDF(K). Since R has order /, K also has order / (unless 
d =O (mod /) in which case K = co). Thus K = d R where d; = d mod I. Now, when 
A sends to B a message m and its authentication tag t = MAC; (m), B can repeatedly 
select 1’ € [0,/ — 1] until t = MAC;:(m) where k’ = KDF(K’) and K’ = I’ R—then 
d, =I' with high probability. The expected number of trials before B succeeds is //2. 
B can repeat the attack with different points R of pairwise relatively prime orders 
l,,/2,...,l;, and combine the results using the Chinese Remainder Theorem to obtain 
d mod [ily ---ls. If h is relatively large, then B can obtain significant information about 
A’s private key d, and can perhaps then deduce all of d by exhaustive search. 

In practice, h is usually small (e.g., h = 1,2 or 4) in which case the small subgroup 
attack described above can only determine a very small number of bits of d. We next 
describe an attack that extends the small subgroup attack to elliptic curves different 
from the one specified in the domain parameters. 


Invalid-curve attacks 


The main observation in invalid-curve attacks is that the usual formulae for adding 
points on an elliptic curve E defined over Fz do not involve the coefficient b (see 
§3.1.2). Thus, if E’ is any elliptic curve defined over IF, whose reduced Weierstrass 
equation differs from E’s only in the coefficient b, then the addition laws for E’ and E 
are the same. Such an elliptic curve E’ is called an invalid curve relative to E. 

Suppose now that A does not perform public key validation on points it receives 
in the one-pass ECDH protocol. The attacker B selects an invalid curve E’ such that 
E’(F,) contains a point R of small order /, and sends R to A. A computes K = dR and 
k = KDF(R). As with the small subgroup attack, when A sends B a message m and its 
tag t = MAC; (m), B can determine d; = d mod /. By repeating the attack with points 
R (on perhaps different invalid curves) of relatively prime orders, B can eventually 
recover d. 

The simplest way to prevent the invalid-curve attacks is to check that a received point 
does indeed lie on the legitimate elliptic curve. 
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4.4 Signature schemes 


Signatures schemes are the digital counterparts to handwritten signatures. They can be 
used to provide data origin authentication, data integrity, and non-repudiation. Signa- 
ture schemes are commonly used by trusted certification authorities to sign certificates 
that bind together an entity and its public key. 


Definition 4.27 A signature scheme consists of four algorithms: 


1. A domain parameter generation algorithm that generates a set D of domain 
parameters. 


2. A key generation algorithm that takes as input a set D of domain parameters and 
generates key pairs (Q, d). 


3. A signature generation algorithm that takes as input a set of domain parameters 
D, a private key d, and a message m, and produces a signature X. 


4. A signature verification algorithm that takes as input the domain parameters D, 
a public key Q, a message m, and a purported signature X, and accepts or rejects 
the signature. 


We assume that the domain parameters D are valid (see §4.2) and that the public key 
Q is valid and associated with D (see §4.3). The signature verification algorithm al- 
ways accepts input (D, Q,m, X&) if & was indeed generated by the signature generation 
algorithm with input (D,d,m). 


The following notion of security of a signature scheme is due to Goldwasser, Micali 
and Rivest (GMR). 


Definition 4.28 A signature scheme is said to be secure (or GMR-secure) if it is ex- 
istentially unforgeable by a computationally bounded adversary who can mount an 
adaptive chosen-message attack. In other words, an adversary who can obtain signa- 
tures of any messages of its choosing from the legitimate signer is unable to produce a 
valid signature of any new message (for which it has not already requested and obtained 
a signature). 


This security definition is a very strong one—the adversary is afforded tremendous 
powers (access to a signing oracle) while its goals are very weak (obtain the signature of 
any message not previously presented to the signing oracle). It can be argued that this 
notion is too strong for some applications—perhaps adversaries are unable to obtain 
signatures of messages of their choice, or perhaps the messages whose signatures they 
are able to forge are meaningless (and therefore harmless) within the context of the 
application. However, it is impossible for the designer of a signature scheme intended 
for widespread use to predict the precise abilities of adversaries in all environments 
in which the signature scheme will be deployed. Furthermore, it is impossible for the 
designer to formulate general criteria to determine which messages will be considered 
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“meaningful.” Therefore, it is prudent to design signature schemes that are secure under 
the strongest possible notion of security—-GMR-security has gained acceptance as the 
“right” one. 

Two standardized signature schemes are presented, ECDSA in §4.4.1, and EC- 
KCDSA in §4.4.2. 


4.4.1 ECDSA 


The Elliptic Curve Digital Signature Algorithm (ECDSA) is the elliptic curve analogue 
of the Digital Signature Algorithm (DSA). It is the most widely standardized elliptic 
curve-based signature scheme, appearing in the ANSI X9.62, FIPS 186-2, IEEE 1363- 
2000 and ISO/IEC 15946-2 standards as well as several draft standards. 

In the following, H denotes a cryptographic hash function whose outputs have 
bitlength no more than that of n (if this condition is not satisfied, then the outputs 
of H can be truncated). 


Algorithm 4.29 ECDSA signature generation 


INPUT: Domain parameters D = (q, FR, S,a,b, P,n,h), private key d, message m. 
OUTPUT: Signature (r,s). 
1. Select k €r [1,n—1]. 
. Compute k P = (x, y;) and convert x; to an integer x1. 
. Compute r =X, mod n. If r =0 then go to step 1. 
. Compute e = H(m). 
. Compute s = k~!(e+dr) mod n. If s = 0 then go to step 1. 
. Return(r, s). 
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Algorithm 4.30 ECDSA signature verification 


INPUT: Domain parameters D = (g,FR, S,a,b, P,n,h), public key Q, message m, 
signature (r, 5). 
OUTPUT: Acceptance or rejection of the signature. 
1. Verify that r and s are integers in the interval [1,7 — 1]. If any verification fails 
then return(“Reject the signature’). 
. Compute e = H(m). 
. Compute w = s~! mod n. 
. Compute uw; = ew mod n and uz =rw mod n. 
. Compute X =ujP+u2Q. 
. If X =o then return(“Reject the signature’); 
. Convert the x-coordinate x; of X to an integer ¥;; compute v =X; mod n. 
. If v=r then return(‘Accept the signature’); 
Else return(“Reject the signature’’). 
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Proof that signature verification works If a signature (r,s) on a message m was 
indeed generated by the legitimate signer, then s =k~!(e+dr) (mod n). Rearranging 
gives 


k=s'(e dr)=s~'e ts rd = wetwrd =u turd (mod n). 














Thus X = u,P+u2Q = (uj +u2d)P =kP, and so v =r as required. 





Security notes 


Note 4.31 (security proofs for ECDSA) In order for ECDSA to be GMR-secure, it is 
necessary that the ECDLP in (P) be intractable, and that the hash function H be cryp- 
tographically secure (preimage resistant and collision resistant). It has not been proven 
that these conditions are also sufficient for GMR-security. ECDSA has, however, been 
proven GMR-secure in the generic group model (where the group (P) is replaced by 
a generic group) and under reasonable and concrete assumptions about H. While a 
security proof in the generic group model does not imply security in the real world 
where a specific group such as an elliptic curve group is used, it arguably inspires some 
confidence in the security of ECDSA. 


Note 4.32 (rationale for security requirements on the hash function) If H is not pre- 
image resistant, then an adversary E may be able to forge signatures as follows. E 
selects an arbitrary integer / and computes r as the x-coordinate of Q +/P reduced 
modulo n. E sets s =r and computes e = r/ mod n. If E can find a message m such 
that e = H(m), then (r,s) is a valid signature for m. 

If H is not collision resistant, then E can forge signatures as follows. She first finds 
two different messages m and m’ such that H(m) = H(m’). She then asks A to sign m; 
the resulting signature is also valid for m’. 


Note 4.33 (rationale for the checks on r and s in signature verification) Step 1 of the 
ECDSA signature verification procedure checks that r and s are integers in the interval 
[1,n— 1]. These checks can be performed very efficiently, and are prudent measures 
in light of known attacks on related ElGamal signature schemes which do not perform 
these checks. The following is a plausible attack on ECDSA if the check r 4 0 (and, 
more generally, r £0 (mod n)) is not performed. Suppose that A is using the elliptic 
curve y* = x* +ax +b over a prime field F p» where b is a quadratic residue modulo 
p, and suppose that A uses a base point P = (0, Vb) of prime order n. (It is plausible 
that all entities may select a base point with zero x-coordinate in order to minimize the 
size of domain parameters.) An adversary can now forge A’s signature on any message 
m of its choice by computing e = H(m). It can readily be checked that (r = 0, s = e) 
is a valid signature for m. 
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Note 4.34 (security requirements for per-message secrets) The per-message secrets k 
in ECDSA signature generation have the same security requirements as the private key 
d. If an adversary EF learns a single per-message secret k that A used to generate a 
signature (r, 5) on some message m, then E can recover A’s private key since 


d=r_'(ks —e) modn (4.12) 


where e = H(m) (see step 5 of ECDSA signature generation). Furthermore, Howgrave- 
Graham and Smart have shown that if an adversary somehow learns a few (e.g., five) 
consecutive bits of per-message secrets corresponding to several (e.g., 100) signed 
messages, then the adversary can easily compute the private key. These observations 
demonstrate that per-message secrets must be securely generated, securely stored, and 
securely destroyed after they have been used. 


Note 4.35 (repeated use of per-message secrets) The per-message secrets k should be 
generated randomly. In particular, this ensures that per-message secrets never repeat, 
which is important because otherwise the private key d can be recovered. To see this, 
suppose that the same per-message secret k was used to generate ECDSA signatures 
(r, 51) and (r, s2) on two messages m, and m2. Then sj = k-!(e; t+ dr) (mod n) and 
sy =k~!(e7+dr) (mod n), where e; = H(m)) and ey = H(mp). Then ks) = e, +dr 
(mod n) and ksz =e2+dr (mod n). Subtraction gives k(s; — s2) =e, —e2 (mod n). 
If s, #52 (mod n), which occurs with overwhelming probability, then 


k=(s1 — 57) (ey —e2) (mod n). 


Thus an adversary can determine k, and then use (4.12) to recover d. 


4.4.2 EC-KCDSA 


EC-KCDSA is the elliptic curve analogue of the Korean Certificate-based Dig- 
ital Signature Algorithm (KCDSA). The description presented here is from the 
ISO/IEC 15946-2 standard. 

In the following, H denotes a cryptographic hash function whose outputs are bit 
strings of length /4. The bitlength of the domain parameter n should be at least /y. 
hcert is the hash value of the signer’s certification data that should include the signer’s 
identifier, domain parameters, and public key. The signer’s private key is an integer 
d pr [1,n], while her public key is Q = d—' P (instead of dP which is the case with 
all other protocols presented in this book). This allows for the design of signature 
generation and verification procedures that do not require performing a modular in- 
version. In contrast, ECDSA signature generation and verification respectively require 
the computation of k~! mod n and s~! mod n. 
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Algorithm 4.36 EC-KCDSA signature generation 


INPUT: Domain parameters D = (g, FR, S,a,b, P,n,h), private key d, hashed certifi- 
cation data hcert, message m. 
OUTPUT: Signature (r,s). 
1. Select k Er [1,n—1]. 
. Compute k P = (x1, y}). 
. Compute r = H(x)). 
. Compute e = H (hcert,m). 
. Compute w =r @e and convert w to an integer W. 
. fw>nthenw<w-—n. 
. Compute s = d(k —w) mod n. If s = 0 then go to step 1. 
. Return(r, s). 


OADM WPY 


Algorithm 4.37 EC-KCDSA signature verification 


INPUT: Domain parameters D = (q, FR, S,a,b, P,n,h), public key Q, hashed certifi- 
cation data hcert, message m, signature (r,s). 
OUTPUT: Acceptance or rejection of the signature. 
1. Verify that the bitlength of r is at most /y and that s is an integer in the interval 
[1,2 —1]. If any verification fails then return(‘“Reject the signature’’). 
. Compute e = A (hcert,m). 
. Compute w =r @e and convert w to an integer W. 
If w >n then w<—w-—n. 
. Compute X =sQ+wWP. 
. Compute v = H(x1) where x is the x-coordinate of X. 
If v =r then return(“Accept the signature’); 
Else return(“Reject the signature”’). 
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Proof that signature verification works If a signature (r,s) on a message m was 
indeed generated by the legitimate signer, then s = d(k —W) (mod n). Rearranging 
gives k =sd~!+W (mod n). Thus X =sQ+WP = (sd~!+W)P =kP, andsov=r 
as required. 














Note 4.38 (use of hcert) In practice, hcert can be defined to be the hash of the signer’s 
public-key certificate that should include the signer’s identity, domain parameters, and 
public key. Prepending hcert to the message m prior to hashing (i.e., when computing 
e = H(hcert,m)) can provide resistance to attacks based on manipulation of domain 
parameters. 


Note 4.39 (security proofs for EC-KCDSA) KCDSA, which operates in a prime-order 
subgroup S of the multiplicative group of a finite field, has been proven GMR-secure 
under the assumptions that the discrete logarithm problem in S is intractable and that 
the hash function H is a random function. Actually, if different hash functions H; 
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and Hp are used in steps 3 and 4, respectively, of the signature generation procedure, 
then the security proof assumes that Hj is a random function and makes the weaker 
assumption that AH is collision resistant. 

A security proof for a protocol that makes the assumption that hash functions em- 
ployed are random functions is said to hold in the random oracle model. Such proofs 
do not imply that the protocol is secure in the real world where the hash function is not 
a random function. Nonetheless, such security proofs do offer the assurance that the 
protocol is secure unless an adversary can exploit properties of the hash functions that 
distinguish them from random functions. 

The security proof for KCDSA extends to the case of EC-KCDSA if the operation 
in step 3 of signature generation is replaced by r = H (x1, y,). 


4.5 Public-key encryption 


Public-key encryption schemes can be used to provide confidentiality. Since they are 
considerably slower than their symmetric-key counterparts, they are typically used only 
to encrypt small data items such as credit card numbers and PINs, and to transport 
session keys which are subsequently used with faster symmetric-key algorithms for 
bulk encryption or message authentication. 


Definition 4.40 A public-key encryption scheme consists of four algorithms: 


1. A domain parameter generation algorithm that generates a set D of domain 
parameters. 


2. A key generation algorithm that takes as input a set D of domain parameters and 
generates key pairs (Q,d). 


3. An encryption algorithm that takes as input a set of domain parameters D, a 
public key Q, a plaintext message m, and produces a ciphertext c. 


4. A decryption algorithm that takes as input the domain parameters D, a private 
key d, a ciphertext c, and either rejects c as invalid or produces a plaintext m. 


We assume D is valid (see §4.2) and that Q is valid and associated with D (see §4.3). 
The decryption algorithm always accepts (D,d,c) and outputs m if c was indeed 
generated by the encryption algorithm on input (D, Q,m). 


The following notion of security of a public-key encryption scheme is due to 
Goldwasser, Micali, Rackoff and Simon. 


Definition 4.41 A public-key encryption scheme is said to be secure if it is indis- 
tinguishable by a computationally bounded adversary who can mount an adaptive 
chosen-ciphertext attack. In other words, an adversary who selects two plaintext mes- 
sages m, and m2 (of the same length) and is then given the ciphertext c of one of them 
is unable to decide with non-negligible advantage whether c is the encryption of m, 
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or m2. This is true even though the adversary is able to obtain the decryptions of any 
ciphertexts (different from the target ciphertext c) of its choosing. 


This security definition is a very strong one—the adversary is unable to do better 
than guess whether c is the encryption of one of two plaintext messages m, and m2 
that the adversary itself chose even when it has access to a decryption oracle. Indis- 
tinguishability against adaptive chosen-ciphertext attacks has gained acceptance as the 
“right” notion of security for public-key encryption schemes. 

Another desirable security property is that it should be infeasible for an adversary 
who is given a valid ciphertext c to produce a different valid ciphertext c’ such that the 
(unknown) plaintext messages m and m’ are related in some known way; this security 
property is called non-malleability. It has been proven that a public-key encryption 
scheme is indistinguishable against adaptive chosen-ciphertext attacks if and only if it 
is non-malleable against adaptive chosen-ciphertext attacks. 


4.5.1 ECIES 


The Elliptic Curve Integrated Encryption Scheme (ECIES) was proposed by Bellare 
and Rogaway, and is a variant of the ElGamal public-key encryption scheme. It has 
been standardized in ANSI X9.63 and ISO/IEC 15946-3, and is in the IEEE P1363a 
draft standard. 

In ECIES, a Diffie-Hellman shared secret is used to derive two symmetric keys ky 
and kz. Key kj is used to encrypt the plaintext using a symmetric-key cipher, while key 
ko is used to authenticate the resulting ciphertext. Intuitively, the authentication guards 
against chosen-ciphertext attacks since the adversary cannot generate valid ciphertexts 
on her own. The following cryptographic primitives are used: 

1. KDF is a key derivation function that is constructed from a hash function H. 
If a key of / bits is required then KDF(S) is defined to be the concatenation of 
the hash values H(S,i), where i is a counter that is incremented for each hash 
function evaluation until / bits of hash values have been generated. 

2. ENC is the encryption function for a symmetric-key encryption scheme such as 
the AES, and DEC is the decryption function. 

3. MAC is a message authentication code algorithm such as HMAC. 


Algorithm 4.42 ECIES encryption 


INPUT: Domain parameters D = (g, FR, S,a,b, P,n,h), public key Q, plaintext m. 
OUTPUT: Ciphertext (R, C, 1). 

1. Select k Er [1,n—1]. 

2. Compute R=kP and Z =hk@Q. If Z = oo then go to step 1. 

3. (ki, ko) <KDF(«z, R), where xz is the x-coordinate of Z. 

4. Compute C = ENC,, (m) and t = MAC,, (C). 

5. Return(R, C, tf). 
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Algorithm 4.43 ECIES decryption 


INPUT: Domain parameters D = (g,FR,S,a,b, P,n,h), private key d, ciphertext 
(R,C, tf). 
OUTPUT: Plaintext m or rejection of the ciphertext. 
1. Perform an embedded public key validation of R (Algorithm 4.26). If the 
validation fails then return(“Reject the ciphertext’). 
. Compute Z = hdR. If Z = oo then return(“Reject the ciphertext’). 
. (ki, k2) <—KDF(xz, R), where xz is the x-coordinate of Z. 
. Compute t’ = MAC;, (C). If t’ # ¢ then return(“Reject the ciphertext’). 
. Compute m = DEC ,, (C). 
. Return(™). 


NNW WD 


Proof that decryption works If ciphertext (R,C,t) was indeed generated by the 
legitimate entity when encrypting m, then 


hdR = hd(kP) =hk(dP) =hk@Q. 


Thus the decryptor computes the same keys (k1,k2) as the encryptor, accepts the ci- 
phertext, and recovers m. 














Security notes 


Note 4.44 (security proofs for ECIES) ECIES has been proven secure (in the sense of 
Definition 4.41) under the assumptions that the symmetric-key encryption scheme and 
MAC algorithm are secure, and that certain non-standard (but reasonable) variants of 
the computational and decision Diffie-Hellman problems are intractable. These Diffie- 
Hellman problems involve the key derivation function KDF. 


Note 4.45 (public key validation) The shared secret point Z = hdR is obtained by 
multiplying the Diffie-Hellman shared secret dk P by h. This ensures that Z is a point in 
the subgroup (P). Checking that Z 4 oo in step 2 of the decryption procedure confirms 
that Z has order exactly n. This, together with embedded key validation performed in 
step 1, provides resistance to the small subgroup and invalid-curve attacks described in 
§4.3 whereby an attacker learns information about the receiver’s private key by sending 
invalid points R. 


Note 4.46 (inputs to the key derivation function) The symmetric keys k, and kz are de- 
rived from the x-coordinate xz of the Diffie-Hellman shared secret Z as well as the 
one-time public key R of the sender. Inclusion of R as input to KDF is necessary 
because otherwise the scheme is malleable and hence also not indistinguishable. An 
adversary could simply replace R in the ciphertext (R,C,t) by —R thus obtaining 
another valid ciphertext with the same plaintext as the original ciphertext. 
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4.5.2 PSEC 


Provably Secure Encryption Curve scheme (PSEC) is due to Fujisaki and Okamoto. 
The version we present here is derived by combining PSEC-KEM, a key encapsula- 
tion mechanism, and DEM1, a data encapsulation mechanism, that are described in 
the ISO 18033-2 draft standard. PSEC-KEM has also been evaluated by NESSIE and 
CRYPTREC. 

The following cryptographic primitives are used in PSEC: 


1. KDF is a key derivation function that is constructed from a hash function. 


2. ENC is the encryption function for a symmetric-key encryption scheme such as 
the AES, and DEC is the decryption function. 


3. MAC is a message authentication code algorithm such as HMAC. 


Algorithm 4.47 PSEC encryption 


INPUT: Domain parameters D = (q, FR, S,a,b, P,n,h), public key Q, plaintext m. 
OUTPUT: Ciphertext (R, C,s, 1). 

1. Select r Ep {0, i, where / is the bitlength of n. 

2. (k’,ky,k2) <-KDF(r), where k’ has bitlength / + 128. 

3. Compute k = k’ mod n. 

4. Compute R=kP and Z=kQ. 

5. Compute s =r @ KDF(R, Z). 

6. Compute C = ENC,, (m) and t = MAC,, (C). 

7. Return(R, C,s,t). 


Algorithm 4.48 PSEC decryption 
INPUT: Domain parameters D = (g,FR,S,a,b, P,n,h), private key d, ciphertext 
(R,C,s,t). 
OUTPUT: Plaintext m or rejection of the ciphertext. 
1. Compute Z = dR. 
. Compute r= 5s ® KDF(R, Z). 
. (k’, k, ky) <KDF(r), where k’ has bitlength / + 128. 
. Compute k = k’ mod n. 
. Compute R’ = kP. 
If R’ $ R then return(“Reject the ciphertext”). 
Compute t’ = MAC;,(C). If ¢! # t then return(“Reject the ciphertext”). 
. Compute m = DEC ,, (C). 
. Return(m). 
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Proof that decryption works If ciphertext (R, C,s,f) was indeed generated by the 
legitimate entity when encrypting m, then dR = d(kP) =k(dP) =kQ. Thus the de- 
cryptor computes the same keys (k’, k;, kz) as the encryptor, accepts the ciphertext, and 
recovers m. 














Note 4.49 (security proofs for PSEC) PSEC has been proven secure (in the sense of 
Definition 4.41) under the assumptions that the symmetric-key encryption and MAC 
algorithms are secure, the computational Diffie-Helman problem is intractable, and the 
key derivation function is a random function. 


4.6 Key establishment 


The purpose of a key establishment protocol is to provide two or more entities commu- 
nicating over an open network with a shared secret key. The key may then be used in a 
symmetric-key protocol to achieve some cryptographic goal such as confidentiality or 
data integrity. 

A key transport protocol is a key establishment protocol where one entity creates 
the secret key and securely transfers it to the others. ECIES (see §4.5.1) can be con- 
sidered to be a two-party key transport protocol when the plaintext message consists 
of the secret key. A key agreement protocol is a key establishment protocol where all 
participating entities contribute information which is used to derive the shared secret 
key. In this section, we will consider two-party key agreement protocols derived from 
the basic Diffie-Hellman protocol. 


Security definition A key establishment protocol should ideally result in the sharing 
of secret keys that have the same attributes as keys that were established by people who 
know each other and meet in a secure location to select a key by repeatedly tossing a 
fair coin. In particular, subsequent use of the secret keys in a cryptographic protocol 
should not in any way reduce the security of that protocol. This notion of security has 
proved very difficult to formalize. Instead of a formal definition, we present an informal 
list of desirable security properties of a key establishment protocol. 


Attack model A secure protocol should be able to withstand both passive attacks 
where an adversary attempts to prevent a protocol from achieving its goals by merely 
observing honest entities carrying out the protocol, and active attacks where an ad- 
versary additionally subverts the communications by injecting, deleting, altering or 
replaying messages. In order to limit the amount of data available for cryptanalytic at- 
tack (e.g., ciphertext generated using a fixed session key in an encryption application), 
each run of a key establishment protocol between two entities A and B should produce 
a unique secret key called a session key. The protocol should still achieve its goal in the 
face of an adversary who has learned some other session keys. 
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Fundamental security goal The fundamental security goals of a key establishment 
protocol are: 


1. Implicit key authentication. A key establishment protocol is said to provide im- 
plicit key authentication (of B to A) if entity A is assured that no other entity 
aside from a specifically identified second entity B can possibly learn the value 
of a particular session key. The property does not imply that A is assured of B 
actually possessing the key. 


2. Explicit key authentication. A key establishment protocol is said to provide key 
confirmation (of B to A) if entity A is assured that the second entity B can 
compute or has actually computed the session key. If both implicit key authenti- 
cation and key confirmation (of B to A) are provided, then the key establishment 
protocol is said to provide explicit key authentication (of B to A). 


Explicit key authentication of both entities normally requires three passes (messages 
exchanged). For a two-party three-pass key agreement protocol, the main security goal 
is explicit key authentication of each entity to the other. 


Other desirable security attributes Other security attributes may also be desirable 
depending on the application in which a key establishment protocol is employed. 


1. Forward secrecy. If long-term private keys of one or more entities are compro- 
mised, the secrecy of previous session keys established by honest entities should 
not be affected. 


2. Key-compromise impersonation resilience. Suppose A’s long-term private key is 
disclosed. Clearly an adversary who knows this value can now impersonate A, 
since it is precisely this value that identifies A. However, it may be desirable that 
this loss does not enable an adversary to impersonate other entities to A. 


3. Unknown key-share resilience. Entity A cannot be coerced into sharing a key 
with entity B without A’s knowledge, that is, when A believes the key is shared 
with some entity C # B, and B (correctly) believes the key is shared with A. 


We present two elliptic curve-based key agreement schemes, the STS protocol in 
§4.6.1 and ECMQV in §4.6.2. Both these protocols are believed to provide explicit key 
authentication and possess the security attributes of forward secrecy, key-compromise 
impersonation resilience, and unknown key-share resilience. 


4.6.1 Station-to-station 


The station-to-station (STS) protocol is a discrete logarithm-based key agreement 
scheme due to Diffie, van Oorschot and Wiener. We present its elliptic curve analogue 
as described in the ANSI X9.63 standard. 

In the following, D = (g, FR, S,a,b, P,n,h) are elliptic curve domain parameters, 
KDF is a key derivation function (see §4.5.1), MAC is a message authentication code 
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algorithm such as HMAC, and SIGN is the signature generation algorithm for a signa- 
ture scheme with appendix such as ECDSA (see §4.4.1) or an RSA signature scheme. If 
any verification in Protocol 4.50 fails, then the protocol run is terminated with failure. 


Protocol 4.50 Station-to-station key agreement 


GOAL: A and B establish a shared secret key. 


PROTOCOL MESSAGES: 
A—>B: A, Ra 
A<B: B, Rpg, Sg =SIGNB(Rp, Ra, A), tp = MAC,, (Rp, Ra, A) 
A—>B: sa, =SIGN,4(Rg, Rp, B), ta = MAC, (Rag, Rp, B) 
1. A selects k4 €r [1,n—1], computes R4 =k, P, and sends A, Rg to B. 
2. B does the following: 
2.1 Perform an embedded public key validation of R4 (see Algorithm 4.26). 
2.2 Select kg €r [1,n—1] and compute Rg = kg P. 
2.3 Compute Z = hkg Rag and verify that Z 4 oo. 
2.4 (ki,k2) <KDF(xz), where xz is the x-coordinate of Z. 
2.5 Compute sg = SIGNB(Rz, Ra, A) and tg = MAC;, (Rg, Ra, A). 
2.6 Send B, Rg, sz, tg to A. 
3. A does the following: 
3.1 Perform an embedded public key validation of Rg (see Algorithm 4.26). 
3.2 Compute Z = hk, Rp and verify that Z 4 oo. 
3.3 (ki, k2) <KDF(«z), where xz is the x-coordinate of Z. 
3.4 Verify that sg is B’s signature on the message (Rpg, Rg, A). 
3.5 Compute t = MAC;, (Rg, Ra, A) and verify that t = fg. 
3.6 Compute s4 = SIGN,4(Ry, Rg, B) and t4 = MAC, (Ry, Rp, B). 
3.7 Send s4, ta to B. 
4. B does the following: 
4.1 Verify that s4 is A’s signature on the message (Rg, Rp, B). 
4.2 Compute t = MAC;, (Ra, Rg, B) and verify that t = t,. 
5. The session key is k2. 


The shared secret is Z = hkakgP, which is derived from the ephemeral (one- 
time) public keys Ra and Rg. Multiplication by h and the check Z ¥ oo ensure 
that Z has order n and therefore is in (P). Successful verification of the signatures 
s4 = SIGN,A(Ra, Rg, B) and sg = SIGNgB(Rz, Ra, A) convinces each entity of the 
identity of the other entity (since the signing entity can be identified by its public sign- 
ing key), that the communications have not been tampered with (assuming that the 
signature scheme is secure), and that the other entity knows the identity of the entity 
with which it is communicating (since this identity is included in the signed message). 
Successful verification of the authentication tags t4 and tg convinces each entity that 
the other entity has indeed computed the shared secret Z (since computing the tags 
requires knowledge of k; and therefore also of Z). 
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4.6.2 ECMQV 


ECMOQYV is a three-pass key agreement protocol that has been been standardized in 
ANSI X9.63, IEEE 1363-2000, and ISO/IEC 15946-3. 

In the following, D = (q, FR, S,a,b, P,n,h) are elliptic curve domain parameters, 
(Qa,da) is A’s key pair, (Qg,dg) is B’s key pair, KDF is a key derivation function 
(see §4.5.1), and MAC is a message authentication code algorithm such as HMAC. If 
R is an elliptic curve point then R is defined to be the integer (¥ mod 2!//21) 4. 2!//?1 
where X is the integer representation of the x-coordinate of R, and f = [log,n|+1 
is the bitlength of n. If any verification in Protocol 4.51 fails, then the protocol run is 
terminated with failure. 


Protocol 4.51 ECMQV key agreement 


GOAL: A and B establish a shared secret key. 


PROTOCOL MESSAGES: 
A— B: A, Ra 
A<B: B, Rp, tp = MAC, (2, B, A, Rg, Ra) 
A> B: ta =MAC;, (3, A, B, Ra, Rp) 
1. A selects k4 €r [1,n—1], computes R4 =k, P, and sends A, Rg to B. 
2. B does the following: 
2.1 Perform an embedded public key validation of R4 (see Algorithm 4.26). 
2.2 Select kg €r [1,n—1] and compute Rg = kz P. 
2.3 Compute sg = (kg +Rpdp) mod nand Z =hsg(Ra+RaQa), and verify 
that Z 4 oo. 
2.4 (kj,k2) <-KDF(xz), where xz is the x-coordinate of Z. 
2.5 Compute tg = MAC,, (2, B, A, Rg, Ra). 
2.6 Send B, Rp, tp to A. 
3. A does the following: 
3.1 Perform an embedded public key validation of Rg (see Algorithm 4.26). 
3.2 Compute s4 = (ka +Rad,) mod nand Z =hsa(Rg+Rz Op), and verify 
that Z 4 oo. 
3.3 (ki, k2) < KDF(*z), where xz is the x-coordinate of Z. 
3.4 Compute t = MAC,, (2, B, A, Rg, Ra) and verify that f = tg. 
3.5 Compute t4 = MAC,;, (3, A, B, Ra, Rg) and send ft, to B. 
4. B computes t = MAC;, (3, A, B, Ra, Rg) and verifies that f = t,. 
5. The session key is kp. 


Protocol 4.51 can be viewed as an extension of the ordinary Diffie-Hellman key 
agreement protocol. The quantity 


Sa=(Ka +Rada) mod n 


serves as an implicit signature for A’s ephemeral public key Ra. It is a ‘signature’ in 
the sense that the only person who can compute s‘¥4 is A, and is ‘implicit’ in the sense 
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that B indirectly verifies its validity by using 
saP =Ra+RaQa 


when deriving the shared secret Z. Similarly, sg is an implicit signature for B’s 
ephemeral public key Rg. The shared secret is Z = hsasgP rather than kakgP as 
would be the case with ordinary Diffie-Hellman; multiplication by / and the check 
Z #o ensure that Z has order n and therefore is in (P). Note that Z is derived us- 
ing the ephemeral public keys (R4 and Rg) as well as the long-term public keys (Q 4 
and Qs) of the two entities. The strings “2” and “3” are included in the MAC inputs 
in order to distinguish authentication tags created by the initiator A and responder B. 
Successful verification of the authentication tags t4 and fg convinces each entity that 
the other entity has indeed computed the shared secret Z (since computing the tags 
requires knowledge of k; and therefore also of Z), that the communications have not 
been tampered with (assuming that the MAC is secure), and that the other entity knows 
the identity of the entity with which it is communicating (since the identities are in- 
cluded in the messages that are MACed). No formal proof of security is known for 
Protocol 4.51. 


4.7 Notes and further references 


§4.1 

The generic group model for proving lower bounds on the discrete logarithm problem 
was developed by Nechaev [344] and Shoup [425]. The Pohlig-Hellman algorithm is 
due to Pohlig and Hellman [376]. 


Although the ECDLP appears to be difficult to solve on classical computers, it is known 
to be easily solvable on quantum computers (computational devices that exploit quan- 
tum mechanical principles). In 1994, Shor [424] presented polynomial-time algorithms 
for computing discrete logarithms and factoring integers on a quantum computer. The 
ECDLP case was studied more extensively by Proos and Zalka [384] who devised 
quantum circuits for performing the elliptic curve group law. Proos and Zalka showed 
that a k-bit instance of the ECDLP can be efficiently solved on a K-qubit quantum 
computer where K © 5k +8/k +5log,k (a qubit is the quantum computer analogue 
of a classical bit). In contrast, Beauregard [31] showed that k-bit integers can be effi- 
ciently factored on a K-qubit quantum computer where K ~ 2k. For example, 256-bit 
instances of the ECDLP are roughly equally difficult to solve on classical computers 
as 3072-bit instances of the integer factorization problem. However, the former can 
be solved on a 1448-qubit quantum computer, while the latter seems to need a 6144- 
qubit quantum computer. Thus, it would appear that larger quantum machines (which 
presumably are more difficult to build) are needed to solve the integer factorization 
problem than the ECDLP for problem instances that are roughly equally difficult to 
solve on classical computers. The interesting question then is when or whether large- 
scale quantum computers can actually be built. This is an area of very active research 
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and much speculation. The most significant experimental result achieved thus far is the 
7-qubit machine built by Vandersypen et al. [464] in 2001 that was used to factor the in- 
teger 15 using Shor’s algorithm. It remains to be seen whether experiments such as this 
can be scaled to factor integers and solve ECDLP instances that are of cryptographic 
interest. The book by Nielsen and Chuang [347] is an excellent and extensive overview 
of the field of quantum computing 


Characteristics of random functions, including the expected tail length and the expected 
cyclic length of sequences obtained from random functions, were studied by Flajolet 
and Odlyzko [143]. The rho algorithm (Algorithm 4.3) for computing discrete loga- 
rithms was invented by Pollard [379]. Pollard’s original algorithm used an iterating 
function with three branches. Teske [458] provided experimental evidence that Pol- 
lard’s iterating function did not have optimal random characteristics, and proposed the 
iterating function used in Algorithm 4.3. Teske [458, 459] gave experimental and theo- 
retical evidence that her iterating function very closely models a random function when 
the number of branches is L = 20. 


Pollard’s rho algorithm can be accelerated by using Brent’s cycle finding algorithm 
[70] instead of Floyd’s algorithm. This yields a reduction in the expected number of 
group operations from 3,/n to approximately 2,/n. A method that is asymptotically 
faster but has significant storage requirements was proposed by Sedgewick, Szymanski 
and Yao [419]. 


The parallelized version of Pollard’s rho algorithm (Algorithm 4.5) is due to van 
Oorschot and Wiener [463]. 


Gallant, Lambert and Vanstone [159] and Wiener and Zuccherato [482] independently 
discovered the methods for speeding (parallelized) Pollard’s rho algorithm using auto- 
morphisms. They also described techniques for detecting when a processor has entered 
a short (and useless) cycle. These methods were generalized to hyperelliptic curves and 
other curves by Duursma, Gaudry and Morain [128]. 


Silverman and Stapleton [434] were the first to observe that the distinguished points 
encountered in Pollard’s rho algorithm during the solution of an ECDLP instance can 
be used in the solution of subsequent ECDLP instances (with the same elliptic curve 
parameters). The use of Pollard’s rho algorithm to iteratively solve multiple ECDLP 
instances was analyzed by Kuhn and Struik [271]. Kuhn and Struik also proved that 
the best strategy for solving any one of k given ECDLP instances is to arbitrarily select 
one of these instances and devote all efforts to solving that instance. 


Pollard’s kangaroo algorithm [379] (introduced under the name lambda method), was 
designed to find discrete logarithms that are known to lie in an interval of length b. Its 
expected running time is 3.28/b group operations and has negligible storage require- 
ments. Van Oorschot and Wiener [463] presented a variant that has modest storage 
requirements and an expected running time of approximately 2\/b group operations. 
They also showed how to parallelize the kangaroo method, achieving a speedup that 
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is linear in the number of processors employed. The parallelized kangaroo method is 
slower than the parallelized rho algorithm when no information is known a priori about 
the discrete logarithm (i.e., b =n). It becomes faster when b < 0.39n. The parallelized 
kangaroo method was further analyzed by Pollard [381]. 


The arguments in §4.1.3 for failure of the index-calculus method for the ECDLP were 
presented by Miller [325] and further elaborated by Silverman and Suzuki [432]. For 
an excellent exposition of the failure of this and other attacks, see Koblitz [257]. 


Silverman [431] proposed an attack on the ECDLP that he termed xedni calculus. Given 
an ECDLP instance (P, Q) on an elliptic curve E over a prime field F,, one first takes 
r <9 different integer linear combinations of P and Q and lifts these r points to points 
in the rational plane Q x Q. One then attempts to find an elliptic curve E defined over 
Q that passes through these points. (This procedure is the reverse of the index-calculus 
method which first lifts the curve and then the points; hence the name “xedni”.) If E (Q) 
has rank <r, then an integer linear dependence relation among the r points can be 
found thereby (almost certainly) yielding a solution to the original ECDLP. In order to 
increase the probability that E (Q) has rank < r, Silverman required that E be chosen so 
that #E (F;) is as small as possible for all small primes ¢, that is, HE(F;) ¥ t+1— Dt, 
(The opposite conditions, #E(F,) © t+ 1+2,/1, called Mestre conditions, were pro- 
posed by Mestre [324] and have been successfully used to obtain elliptic curves over Q 
of higher than expected rank.) Shortly after Silverman proposed xedni calculus, Koblitz 
(see Appendix K of [431]) observed that xedni calculus could be adapted to solve both 
the ordinary discrete logarithm problem and the integer factorization problem. Thus, if 
the xedni-calculus attack were efficient, then it would adversely affect the security of 
all the important public-key schemes. Fortunately (for proponents of public-key cryp- 
tography), Jacobson, Koblitz, Silverman, Stein and Teske [222] were able to prove that 
xedni calculus is ineffective asymptotically (as p — oo), and also provided convincing 
experimental evidence that it is extremely inefficient for primes p of the sizes used in 
cryptography. 

Isomorphism attacks on prime-field-anomalous elliptic curves were discovered inde- 
pendently by Satoh and Araki [401], Semaev [420] and Smart [438]. Semaev’s attack 
was generalized by Riick [397] to the DLP in subgroups of order p of the jacobian 
of an arbitrary curve (including a hyperelliptic curve) defined over a finite field of 
characteristic p. 


The Weil pairing and Tate pairing attacks are due to Menezes, Okamoto and Vanstone 
[314], and Frey and Riick [150], respectively. Balasubramanian and Koblitz [27] proved 
that the embedding degree k is large for most elliptic curves of prime order defined 
over prime fields. The Tate pairing attack applies to the jacobian of any non-singular 
irreducible curve over a finite field F, (subject to the condition that the order n of the 
base element satisfies gcd(n, gq) = 1). Galbraith [155] derived upper bounds k(g) on the 
embedding degree k for supersingular abelian varieties of dimension g over finite fields; 
these varieties include the jacobians of genus-g supersingular curves. The bounds were 
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improved by Rubin and Silverberg [396]. Constructive applications of supersingular 
curves (and bilinear maps in general) include the three-party one-round Diffie-Hellman 
protocol of Joux [227], the identity-based public-key encryption scheme of Boneh and 
Franklin [58, 59], the hierarchical identity-based encryption and signature schemes of 
Horwitz and Lynn [199] and Gentry and Silverberg [170], the short signature scheme 
of Boneh, Lynn and Shacham [62], the aggregate signature scheme of Boneh, Gentry, 
Lynn and Shacham [60], the self-blindable certificate scheme of Verheul [472], and the 
efficient provably secure signature scheme of Boneh, Mironov and Shoup [63]. 


Frey first presented the Weil descent attack methodology in his lecture at the ECC °98 
conference (see [149]). Frey’s ideas were further elaborated by Galbraith and Smart 
[158]. The GHS attack was presented by Gaudry, Hess and Smart [167] (see also Hess 
[196]). It was shown to fail for all cryptographically interesting elliptic curves over 
Fm for all prime m € [160,600] by Menezes and Qu [315]. Jacobson, Menezes and 
Stein [223] used the GHS attack to solve an actual ECDLP instance over F124 by 
first reducing it to an HCDLP instance in a genus-31 hyperelliptic curve over Fa, 
and then solving the latter with the Enge-Gaudry subexponential-time algorithm [163, 
133]. Maurer, Menezes and Teske [304] completed the analysis of the GHS attack 
by identifying and enumerating the isomorphism classes of elliptic curves over Fam 
for composite m € [160,600] that are most vulnerable to the GHS attack. Menezes, 
Teske and Weng [318] showed that the fields Fy», where m € [185, 600] is divisible 
by 5, are weak for elliptic curve cryptography in the sense that the GHS attack can 
be used to solve the ECDLP significantly faster than Pollard’s rho algorithm for all 
cryptographically interesting elliptic curves over these fields. 


Elliptic curves E; and E> defined over Fy” are said to be isogenous over Fg if 
#E\ (Fo) = #E2(Fq”). Galbraith, Hess and Smart [156] presented a practical algo- 
rithm for explicitly computing an isogeny between two isogenous elliptic curves over 
Fj». They observed that their algorithm could be used to extend the effectiveness of the 
GHS attack as follows. Given an ECDLP instance on some cryptographically interest- 
ing elliptic curve E; over Fy, one can check if E is isogenous to some elliptic curve 
E> over F 2» for which the GHS reduction yields an easier HCDLP instance than EF}. 
One can then use an isogeny ¢ : E; — E> to map the ECDLP instance to an ECDLP 
instance in E2(F 2”) and solve the latter using the GHS attack. For example, in the case 
m = 155, we can expect that roughly 2! out of the 2!>° isomorphism classes of ellip- 
tic curves over F155 are isogeous to one of the approximately 2°? elliptic curves over 
F155 originally believed to be susceptible to the GHS attack. Thus, the GHS attack 
may now be effective on 2! out of the 2!°¢ elliptic curves over F155. 


Arita [18] showed that some elliptic curves over finite fields F'3» of characteristic three 
may also be susceptible to the Weil descent attack. Diem [118, 119] has shown that 
the GHS attack can be extended to elliptic curves over Fp” where p > 5 is prime. 
He concludes that his particular variant of the GHS attack will always fail when m is 
prime and m > 11— that is, the discrete logarithm problem in the resulting higher-genus 
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curves is intractable. However, he provide some evidence that the attack might suceed 
for some elliptic curves when m = 3,5 or 7. Further research and experimentation is 
necessary before the cryptographic implications of Diem’s work are fully understood. 


Den Boer [112] proved the equivalence of the discrete logarithm and Diffie-Hellman 
problems in arbitrary cyclic groups of order n where $(7) has no large prime factors 
(#(n) is the Euler phi function). These results were generalized by Maurer [305]; see 
also Maurer and Wolf [307]. Boneh and Lipton [61] formulated problems in generic 
fields (which they call black-box fields), and proved the result that hardness of the 
ECDLP implies hardness of the ECDHP. Boneh and Shparlinski [64] proved that if the 
ECDHP is hard in a prime-order subgroup (P) C E(IF,) of an elliptic curve E defined 
over a prime field F',, then there does not exist an efficient algorithm that predicts the 
least significant bit of either the x-coordinate or the y-coordinate of the Diffie-Hellman 
secret point for most elliptic curves isomorphic to E. This does not exclude the ex- 
istence of efficient prediction algorithms for each of the isomorphic elliptic curves. 
Boneh and Shparlinski’s work provides some evidence that computing the least sig- 
nificant bit of either the x-coordinate or the y-coordinate of the Diffie-Hellman secret 
point abP from (P,aP, bP) is as hard as computing the entire point ab P. 


A comprehensive survey (circa 1998) of the decision Diffie-Hellman problem and its 
cryptographic applications is given by Boneh [54]. Joux and Nguyen [229] (see also 
Verheul [471]) give examples of supersingular elliptic curves for which the discrete 
logarithm and Diffie-Hellman problems are equivalent (and not known to be solvable in 
polynomial time), but for which the decision Diffie-Hellman problem can be efficiently 
solved. 


§4.2 

Algorithms 4.14 and 4.15 (domain parameter generation and validation), and Algo- 
rithms 4.17, 4.18, 4.19 and 4.21 (generation and verification of random elliptic curves 
over prime fields and binary fields) are extracted from ANSI X9.62 [14]. Vaudenay 
[467] studied the procedures for generating random elliptic curves and suggested some 
enhancements. In particular, he proposed including the field order and representation 
as input in the binary field case. 


Lenstra [285] proved that the orders of elliptic curves over a prime field are roughly 
uniformly distributed in the Hasse interval. Howe [201] extended Lenstra’s results to 
obtain, for any finite field F, and prime power I‘, estimates for the probability that a 
randomly selected elliptic curve over F, has order #E(F,) divisible by lk. The early- 
abort strategy was first studied by Lercier [287]. 


The complex multiplication method for prime fields is described by Atkin and Morain 
[20] (see also Buchmann and Baier [79]), for binary fields by Lay and Zimmer [276], 
and for optimal extension fields by Baier and Buchmann [24]. Weng [479] intro- 
duced a CM method for generating hyperelliptic curves of genus 2 that are suitable 
for cryptographic applications. 
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Schoof’s algorithm [411], originally described for elliptic curves over finite fields over 
odd characteristic, was adapted to the binary field case by Koblitz [252]. An extensive 
treatment of Schoof’s algorithm [411] and its improvements by Atkin and Elkies (and 
others) is given by Blake, Seroussi and Smart [49, Chapter VII]. Lercier and Morain 
[289] and Izu, Kogure, Noro and Yokoyama [218] report on their implementations of 
the SEA algorithm for the prime field case. The latter implementation on a 300 MHz 
Pentium II counts the number of points on a 240-bit prime field in about 7.5 minutes, 
and can generate an elliptic curve of prime order over a 240-bit prime field in about 3 
hours. Extensions of Schoof’s algorithm to genus-2 hyperelliptic curves were studied 
by Gaudry and Harley [166]. 


Satoh [400] presented his point counting algorithm for elliptic curves over finite fields 
of small characteristic greater than five. It was extended to elliptic curves over binary 
fields by Fouquet, Gaudry and Harley [146] and Skjernaa [436]. Many variants for 
the binary field case have subsequently been proposed. A variant that has lower mem- 
ory requirements was devised by Vercauteren, Preneel and Vandewalle [470]. Fouquet, 
Gaudry and Harley [147] explore combinations with an early abort strategy for the 
purpose of generating elliptic curves of almost-prime orders. The SST variant was pro- 
posed by Satoh, Skjernaa and Taguchi [402]. The AGM method, developed by Mestre, 
Harley and Gaudry is described by Gaudry [164] who also presents refinements and 
comparisons of the AGM and SST algorithms. Gaudry reports that his modified-SST 
algorithm can determine the number of points on randomly chosen elliptic curves over 
F163 and F239 in 0.13 seconds and 0.40 seconds, respectively, on a 700 MHz Pen- 
tium III. Further enhancements for binary fields having a Gaussian normal basis of 
small type have been reported by Kim et al. [243], Lercier and Lubicz [288], and Harley 
[192]. 


Another noteworthy algorithm is that of Kedlaya [240] for counting the points on hy- 
perelliptic curves (of any genus) over finite fields of small odd characteristic. Kedlaya’s 
algorithm was extended by Vercauteren [469] to hyperelliptic curve over binary fields, 
by Gaudry and Giirel [165] to superelliptic curves y’ = f(x) over finite fields of small 
characteristic different from 7, and by Denef and Vercauteren [113] to Artin-Schreier 
curves y* +x” y = f (x) over binary fields. 


§4.3 

The need for public key validation was evangelized by Johnson [224, 225] at various 
standards meetings. Small subgroup attacks on discrete logarithm protocols are due 
to Vanstone (as presented by Menezes, Qu and Vanstone [316]), van Oorschot and 
Wiener [462], Anderson and Vaudenay [13], and Lim and Lee [296]. The invalid-curve 
attacks are extensions of the small subgroup attacks to invalid curves, using the ideas 
behind the differential fault attacks on elliptic curve schemes by Biehl, Meyer and 
Miiller [46]. Invalid-curve attacks were first described by Antipa, Brown, Menezes, 
Struik and Vanstone [16] who also demonstrated their potential effectiveness on the 
ECIES encryption scheme and the one-pass ECMQV key agreement protocol. 
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§4.4 

The concept of a signature scheme was introduced in 1976 by Diffie and Hellman [121]. 
The first signature scheme based on the discrete logarithm problem was proposed in 
1984 by ElGamal [131]. There are many variants of ElGamal’s scheme including DSA, 
KCDSA, and schemes proposed by Schnorr [410] and Okamoto [354]. The notion of 
GMR-security (Definition 4.28) is due to Goldwasser, Micali and Rivest [175]. 


ECDSA is described by Johnson, Menezes and Vanstone [226]. An extensive security 
analysis was undertaken by Brown [75] who proved the GMR-security of ECDSA in 
the generic group model. Dent [114] demonstrated that security proofs in the generic 
group model may not provide any assurances in practice by describing a signature 
scheme that is provably secure in the generic group model but is provably insecure 
when any specific group is used. Stern, Pointcheval, Malone-Lee and Smart [452] no- 
ticed that ECDSA has certain properties that no longer hold in the generic group model, 
further illustrating limitations of security proofs in the generic group model. 


Howgrave-Graham and Smart [202] first showed that an adversary can efficiently re- 
cover a DSA or ECDSA private key if she knows a few bits of each per-message secret 
corresponding to some signed messages (see Note 4.34). Their attacks were formally 
proven to work for DSA and ECDSA by Nguyen and Shparlinski [345, 346], and for 
the Nyberg-Rueppel signature scheme by El Mahassni, Nguyen and Shparlinski [130]. 
Romer and Seifert [392] presented a variant of this attack on ECDSA. 


EC-KCDSA was first described by Lim and Lee [297]. The description provided in 
§4.4.2 is based on the ISO/IEC 15946-2 standard [212]. The random oracle model 
was popularized by Bellare and Rogaway [37]. Canetti, Goldreich and Halevi [83] 
presented public-key encryption and signature schemes which they proved are secure 
in the random oracle model, but insecure for any concrete instantiation of the random 
function. Their work demonstrates that caution must be exercised when assessing the 
real-world security of protocols that have been proven secure in the random oracle 
model. Pointcheval and Stern [378] and Brickell, Pointcheval, Vaudenay and Yung [73] 
proved the security of several variants of DSA (and also ECDSA) in the random oracle 
model. The security proofs do not appear to extend to DSA and ECDSA. The security 
proof of KCDSA mentioned in Note 4.39 is due to Brickell, Pointcheval, Vaudenay and 
Yung [73]. 


Signature schemes such as ECDSA and EC-KCDSA are sometimes called signature 
schemes with appendix because the message m is a required input to the verification 
process. Signature schemes with (partial) message recovery are different in that they 
do not require the (entire) message as input to the verification algorithm. The message, 
or a portion of it, is recovered from the signature itself. Such schemes are desirable in 
environments where bandwidth is extremely constrained. The Pintsov- Vanstone (PV) 
signature scheme [375] is an example of a signature scheme with partial message recov- 
ery. It is based on a signature scheme of Nyberg and Rueppel [350] and was extensively 
analyzed by Brown and Johnson [76] who provided security proofs under various as- 
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sumptions. Another elliptic curve signature scheme providing partial message recovery 
is that of Naccache and Stern [341]. 


§4.5 

The notion of indistinguishability (also known as polynomial security) for public- 
key encryption schemes (Definition 4.41) was conceived by Goldwasser and Micali 
[174]. They also formalized the security notion of semantic security—where a com- 
putationally bounded adversary is unable to obtain any information about a plaintext 
corresponding to a given ciphertext—and proved that the two security notions are 
equivalent (under chosen-plaintext attacks). The concept of non-malleability was 
introduced by Dolev, Dwork and Naor [123, 124]. Rackoff and Simon [389] are usu- 
ally credited for the requirement that these security properties hold under adaptive 
chosen-ciphertext attacks. Bellare, Desai, Pointcheval and Rogaway [36] studied the 
relationships between various security notions for public-key encryption schemes and 
proved the equivalence of indistinguishability and non-malleability against adaptive 
chosen-ciphertext attacks. 


The security definitions are in the single-user setting where there is only one legitimate 
entity who can decrypt data and the adversary’s goal is to compromise the security of 
this task. Bellare, Boldyreva and Micali [35] presented security definitions for public- 
key encryption in the multi-user setting. The motivation for their work was to account 
for attacks such as Hastad’s attacks [195] whereby an adversary can easily recover 
a plaintext m if the same m (or linearly related m) is encrypted for three legitimate 
entities using the basic RSA encryption scheme with encryption exponent e = 3. Note 
that Hastad’s attacks cannot be considered to defeat the security goals of public-key 
encryption in the single-user setting where there is only one legitimate entity. Bellare, 
Boldyreva and Micali proved that security in the single-user setting implies security in 
the multi-user setting. 


ECIES, a variant of the EIGamal public-key encryption scheme [131], was proposed by 
Bellare and Rogaway [40]. Abdalla, Bellare and Rogaway [1] formulated three variants 
of the computational and decision Diffie-Hellman problems whose intractability was 
sufficient for the security of ECIES. Smart [441] adapted the proof to the generic group 
model where the Diffie-Hellman intractability assumptions are replaced by the assump- 
tion that the group is generic. Cramer and Shoup [106] proved the security of ECIES 
in the random oracle model under the assumption that the ECDHP problem is hard 
even if an efficient algorithm for the ECDDHP is known. Solving the Diffie-Hellman 
problem given an oracle for the decision Diffie-Hellman problem is an example of a 
gap problem, a notion introduced by Okamoto and Pointcheval [356]. 


PSEC is based on the work of Fujisaki and Okamoto [152]. Key encapsulation mecha- 
nisms were studied by Cramer and Shoup [106]. PSEC-KEM, DEM1, and the security 
proof of PSEC were presented by Shoup in ISO 18033-2 [215]. 
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Cramer and Shoup [105] presented a discrete logarithm-based public-key encryption 
scheme that is especially notable because it was proven secure in a standard model 
(i.e., not in idealized models such as the generic group or random oracle model). 
The security proof assumes the intractability of the decision Diffie-Hellman problem 
and makes reasonable assumptions about the hash function employed. An extension 
of the scheme for encrypting messages of arbitrary lengths was proved secure by 
Shoup [426] under the computational Diffie-Hellman assumption in the random oracle 
model where the hash function is modeled as a random function. One drawback of the 
Cramer-Shoup scheme is that the encryption and decryption procedures require more 
group exponentiations (point multiplications in the elliptic curve case) than competing 
schemes. 


Some other notable discrete logarithm-based public-key encryption schemes are those 
that can be derived from the general constructions of Pointcheval [377], and Okamoto 
and Pointcheval [357]. These constructions convert any public-key encryption scheme 
that is indistinguishable against passive attacks (such as the basic ElIGamal scheme) to 
one that is provably indistinguishable against adaptive chosen-ciphertext attacks in the 
random oracle model. 


$4.6 

The Diffie-Hellman key agreement protocol was introduced in the landmark paper of 
Diffie and Hellman [121]. Boyd and Mathuria [68] provide a comprehensive and up- 
to-date treatment of key transport and key agreement protocols. See also Chapter 12 
of Menezes, van Oorschot and Vanstone [319], and the survey of authenticated Diffie- 
Hellman protocols by Blake-Wilson and Menezes [50]. 


The most convincing formal definition of a secure key establishment protocol is that of 
Canetti and Krawczyk [84]; see also Canetti and Krawczyk [85]. 


The STS key agreement protocol (Protocol 4.50) is due to Diffie, van Oorschot and 
Wiener [122]. Blake-Wilson and Menezes [51] presented some plausible unknown key- 
share attacks on the STS protocol when the identity of the intended recipient is not 
included in the messages that are signed and MACed. Protocols that are similar (but 
not identical) to Protocol 4.50 were proven secure by Canetti and Krawczyk [84]. 


The ECMQV key agreement protocol (Protocol 4.51) was studied by Law, Menezes, 
Qu, Solinas and Vanstone [275], who provide some heuristic arguments for its security 
and also present a one-pass variant. Kaliski [237] described an unknown key-share 
attack on a two-pass variant of the ECMQV protocol that does not provide key 
confirmation. The three-pass Protocol 4.51 appears to resist this attack. 


Many different authenticated Diffie-Hellman key agreement protocols have been pro- 
posed and analyzed. Some well-known examples are the OAKLEY protocol of Orman 
[363], the SKEME protocol of Krawczyk [269], and the Internet Key Exchange (IKE) 
protocol due to Harkins and Carrell [190] and analyzed by Canetti and Krawczyk [85]. 








CHAPTER 5 


Implementation Issues 


This chapter introduces some engineering aspects of implementing cryptographic so- 
lutions based on elliptic curves efficiently and securely in specific environments. The 
presentation will often be by selected examples, since the material is necessarily 
platform-specific and complicated by competing requirements, physical constraints and 
rapidly changing hardware, inelegant designs, and different objectives. The coverage 
is admittedly narrow. Our goal is to provide a glimpse of engineering considerations 
faced by software developers and hardware designers. The topics and examples chosen 
illustrate general principles or involve hardware or software in wide use. 


Selected topics on efficient software implementation are presented in §5.1. Although 
the coverage is platform-specific (and hence also about hardware), much of the mate- 
rial has wider applicability. The section includes notes on use of floating-point and 
single-instruction multiple-data (vector) operations found on common workstations to 
speed field arithmetic. §5.2 provides an introduction to the hardware implementation 
of finite field and elliptic curve arithmetic. §5.3 on secure implementation introduces 
the broad area of side-channel attacks. Rather than a direct mathematical assault on 
security mechanisms, such attacks attempt to glean secrets from information leaked 
as a consequence of physical processes or implementation decisions, including power 
consumption, electromagnetic radiation, timing of operations, fault analysis, and anal- 
ysis of error messages. In particular, simple and differential power analysis have been 
shown to be effective against devices such as smart cards where power consumption 
can be accurately monitored. For such devices, tamper-proof packaging may be inef- 
fective (or at least expensive) for protecting embedded secrets. The section discusses 
some algorithmic countermeasures which can minimize or mitigate the effectiveness 
of side-channel attacks, typically at the cost of some efficiency. 
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5.1 Software implementation 


This section collects a few topics which involve platform-specific details to a greater 
extent than earlier chapters. At this level, software implementation decisions are driven 
by underlying hardware characteristics, and hence this section is also about hardware. 
No attempt has been made to be comprehensive; rather, the coverage is largely by 
example. For material which focuses on specific platforms, we have chosen the Intel 
IA-32 family (commonly known as x86 processors, in wide use since the 1980s) and 
the Sun SPARC family. 

§5.1.1 discusses some shortcomings of traditional approaches for integer multiplica- 
tion, in particular, on the Intel Pentium family processors. §5.1.2 and §5.1.3 present 
an overview of technologies and implementation issues for two types of hardware 
acceleration. Many common processors possess floating-point hardware that can be 
used to implement prime field arithmetic. A fast method presented by Bernstein us- 
ing floating-point methods is outlined in §5.1.2. §5.1.3 considers the single-instruction 
multiple-data (SIMD) registers present on Intel and AMD processors, which can be 
used to speed field arithmetic. The common MMxX subset is suitable for binary field 
arithmetic, and extensions on the Pentium 4 can be used to speed multiplication in 
prime fields using integer operations rather than floating point methods. §5.1.4 consists 
of miscellaneous optimization techniques and implementation notes, some of which 
concern requirements, characteristics, flaws, and quirks of the development tools the 
authors have used. Selected timings for field arithmetic are presented in §5.1.5. 


5.1.1 Integer arithmetic 


In “classical” implementations of field arithmetic for F, where p is prime, the field 
element a is represented as a series of W-bit integers 0 < aj < 2” where W is the 
wordsize on the target machine (e.g., W = 32) and a = a a;2™'. Schoolbook 
multiplication uses various scanning methods, of which product scanning (Algo- 
rithm 2.10) consecutively computes each output word of c = ab (and reduction is 
done separately). A multiply-and-accumulate strategy with a three-register accumulator 


(r2,71,70) consists primarily of -? repeated fragments of the form 


(uv) <—ajb; 
(€,ro) —ro tv (5.1) 
(€,r))<- ry tute , 


19.7127 € 





where (uv) is the 2W-bit product of a; and b; and ¢ is the carry bit. Karatsuba-Ofman 
techniques (see §2.2.2) reduce the number of multiplications and are faster asymp- 
totically, but the overhead often makes such methods uncompetitive for field sizes of 
practical interest. 
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Processor Year MHz Cache (KB) Selected features 


1985 16 First IA-32 family processor with 32-bit 
operations and parallel stages. 

1989 25 LI1:8 Decode and execution units expanded in five 
pipelined stages in the 486; processor is capable 
of one instruction per clock cycle. 

Pentium 1993 Dual-pipeline: optimal pairing in U-V pipes 

Pentium MMX 1997 could give throughput of two instructions per 
clock cycle. MMX added eight special-purpose 
64-bit “multimedia” registers, supporting op- 
erations on vectors of 1, 2, 4, or 8-byte 
integers. 

Pentium Pro L1: 16 P6 architecture introduced more sophisticated 

L2: 256,512 pipelining and out-of-order execution. Instruc- 

Pentium II Liz32 tions decoded to p-ops, with up to three 

L2:256,512 j-ops executed per cycle. Improved branch 
Celeron L2: 0,128 prediction, but misprediction penalty much 
Pentium III L1:32 larger than on Pentium. Integer multiplication 
L2:512 latency/throughput 4/1 vs 9/9 on Pentium. Pen- 
tium II and newer have MMX; the III introduced 
SSE extensions with 128-bit registers support- 
ing operations on vectors of single-precision 
floating-point values. 

Pentium 4 NetBurst architecture runs at significantly 
higher clock speeds, but many instructions have 
worse cycle counts than P6 family processors. 
New 12K j1-op “execution trace cache” mech- 
anism. SSE2 extensions have double-precision 
floating-point and 128-bit packed integer data 
types. 





Table 5.1. Partial history and features of the Intel IA-32 family of processors. Many variants 
of a given processor exist, and new features appear over time (e.g., the original Celeron had no 
cache). Cache comparisons are complicated by the different access speeds and mechanisms (e.g., 
newer Pentium IIs use an advanced transfer cache with smaller level 1 and level 2 cache sizes). 


To illustrate the considerations involved in evaluating strategies for multiplication, 
we briefly examine the case for the Intel Pentium family of processors, some of which 
appear in Table 5.1. The Pentium is essentially a 32-bit architecture, and said to be 
“superscalar” as it can process instructions in parallel. The pipelining capability is eas- 
iest to describe for the original Pentium, where there were two general-purpose integer 
pipelines, and optimization focused on organizing code to keep both pipes filled subject 
to certain pipelining constraints. The case is more complicated in the newer processors 
of the Pentium family, which use more sophisticated pipelining and techniques such as 
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out-of-order execution. For the discussion presented here, only fairly general properties 
of the processor are involved. 

The Pentium possesses an integer multiplier that can perform a 32 x 32-bit multipli- 
cation (giving a 64-bit result). However, there are only eight (mostly) general-purpose 
registers, and the multiplication of interest is restrictive in the registers used. Of fun- 
damental interest are instruction latency and throughput, some of which are given in 
Table 5.2. Roughly speaking, latency is the number of clock cycles required before the 
result of an operation may be used, and throughput is the number of cycles that must 
pass before the instruction may be executed again.! Note that small latency and small 
throughput are desirable under these definitions. 


Instruction Pentium II/NI Pentium 4 
Integer add, xor,... O:/59 
Integer add, sub with carry 6-8 / 2-3 


Integer multiplication 14-18 / 3-5 
Floating-point multiply 7/2 
MMX ALU 2/2 
MMX multiply 8/2 





Table 5.2. Instruction latency / throughput for the Intel Pentium II/III vs the Pentium 4. 


Fragment (5.1) has two performance bottlenecks: the dependencies between instruc- 
tions work against pipelining, and there is a significant latency period after the multiply 
(especially on the Pentium 4). Strategies for improving field multiplication (e.g., by 
reducing simultaneously) using general-purpose registers are constrained by the very 
few such registers available, carry handling, and the restriction to fixed output regis- 
ters for the multiplication of interest. Some useful memory move instructions can be 
efficiently inserted into (5.1). On the Pentium II/IIL, it appears that no reorganization 
of the code can make better use of the latency period after the multiply, and multipli- 
cation of t-word integers requires an average of approximately seven cycles to process 
each 32 x 32 multiplication. Code similar to fragment (5.1) will do much worse on the 
Pentium 4. 


Redundant representations The cost of carry handling can be reduced in some cases 
by use of a different field representation. The basic idea is to choose W’ < W and 
represent elements as a = ¥ a2" where |a;| may be somewhat larger than a4 
(and hence such representations are not unique, and more words may be required to 
represent a field element). Additions, for example, may be done without any processing 
of carry. For field multiplication, choosing W’ so that several terms ajb; in c = ab 
may be accumulated without carrying into a third word may be desirable. Roughly 


‘Intel defines latency as the number of clock cycles that are required for the execution core to complete 
all of the jwops that form an IA-32 instruction, and throughput as the number of clock cycles required to 
wait before the issue ports are free to accept the same instruction again. For many [A-32 instructions, the 
throughput of an instruction can be significantly less than its latency. 
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speaking, this is the strategy discussed in the next section, where W’ is such that the 
approximately 2W’-bit quantity ajb; can be stored in a single (wide) floating-point 
register. 


5.1.2 Floating-point arithmetic 


The floating-point hardware present on many workstations can be used to perform inte- 
ger arithmetic. The basic techniques are not new, although the performance benefits on 
common hardware has perhaps not been fully appreciated. As in the preceding section, 
the examples will be drawn primarily from the Intel Pentium family; however, much of 
the discussion applies to other platforms. 

A rational number of the form 2°m where e and m are integers with |m| < 2° is said to 
be a b-bit floating-point number. Given a real number z, fp,,(z) denotes a b-bit floating- 
point value close to z in the sense that |z — fp, (z)| < 2¢—! if |z| < 2°+?. A b-bit floating- 
point value 2°m is the desired approximation for |z| € (Oh). gore_ge-l). a 
simple example in the case b = 3 appears in the following table. 


e z-interval 3-bit approximation max error 
—-l [2—1/8,4-1/4] 2! = lay .ag 1/4 

0 [4—1/4,8-1/2] 29m =layao 1/2 

1 [8—1/2, 16-1] 2!m = layao0 1 


If z is a b-bit floating-point value, then z = fp, (z). Subject to constraints on the expo- 
nents, floating-point hardware can find fp, (x + y) and fp,(xy) for b-bit floating-point 
values x and y, where b depends on the hardware. 

IEEE single- and double-precision floating-point formats consist of a sign bit s, 
biased exponent e, and fraction f. A double-precision floating-point format 


e (11-bit exponent) f (52-bit fraction) 


63 62 52 51 0 


represents numbers z = (—1)° x 2°—!03 x 1.f; the normalization of the significand 


1.f increases the effective precision to 53 bits.” Floating-point operations are some- 
times described using the length of the significand, such as 53-bit for double precision. 
The Pentium has eight floating-point registers, where the length of the significand is se- 
lected in a control register. In terms of fp,,, the Pentium has versions for b € {24, 53, 64} 
(corresponding to formats of size 32, 64, and 80 bits). 

Coding with floating-point operations requires strategies that are not merely direct 
translations of what is done in the classical case. The numbers are stored in different 
formats, and it is not economical to repeatedly move between the formats. Bit opera- 
tions that are convenient in integer format (e.g., extraction of specific bits) are generally 


2 similar normalization occurs for 32-bit single-precision and the 80-bit double extended-precision 
formats; however, the entire 64-bit significand is retained in extended-precision format. 
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clumsy (and slow) if attempted on values in floating point registers. On the other hand, 
floating-point addition operates on more bits than addition with integer instructions if 
W = 32, and the extra registers are welcomed on register-poor machines such as the 
Pentium. Multiplication latency is still a factor (in fact, it’s worse on the Pentium I/II 
than for integer multiplication—see Table 5.2); however, there are more registers and 
the requirement for specific register use is no longer present, making it possible to do 
useful operations during the latency period. 

A multiprecision integer multiplication can be performed by a combination of 
floating-point and integer operations. If the input and output are in canonical (multi- 
word integer) format, the method is not effective on Intel P6-family processors; 
however, the longer latencies of the Pentium 4 encourage a somewhat similar strat- 
egy using SIMD capabilities (§5.1.3), and the combination has been used on SPARC 
processors. 


Example 5.1 (SPARC multiplication) The SPARC (Scalable Processor ARChitecture) 
specification is the basis for RISC (Reduced Instruction Set Computer) designs from 
Sun Microsystems. Unlike the Pentium where an integer multiply instruction is avail- 
able, the 32-bit SPARC-V7 processors had only a “multiply step” instruction MULScc, 
and multiplication is essentially shift-and-add with up to 32 repeated steps. 

The SPARC-V9 architecture extends V8 to include 64-bit address and data types, ad- 
ditional registers and instructions, improved processing of conditionals and branching, 
and advanced support for superscalar designs. In particular, the V7 and V8 multiply op- 
erations are deprecated in favour of a new MUL-X instruction that produces the lower 64 
bits of the product of 64-bit integers. In the Sun UltraSPARC, MULX is relatively slow 
for generic 32 x 32 multiplication; worse, the instruction does not cooperate with the 
superscalar design which can issue four instructions per cycle (subject to moderately 
restrictive constraints). 

Due to the limitations of MULX, the multiprecision library GNU MP (see Appendix 
C) implements integer multiplication using floating-point registers on V9-compatible 
processors. Multiplication of a with 64-bit b splits a into 32-bit half-words and b into 
four 16-bit pieces, and eight floating-point multiplications are performed for each 64- 
bit word of a. Pairs (four per word of a) of 48-bit partial products are summed using 
floating-point addition; the remaining operations are performed after transfer to integer 
form. On an UltraSPARC I or II, the 56 instructions in the main loop of the calculation 
for ab (where 64-bits of a are processed per iteration and b is 64-bit) are said to execute 
in 14 cycles (4 instructions per cycle). 


The conversions between integer and floating-point formats on each field multipli- 
cation allow floating-point variations to be inserted relatively painlessly into existing 
code. However, more efficient curve arithmetic may be constructed if the number of 
conversions can be minimized across curve operations. 
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Scalar multiplication for P-224 


We outline a fast method due to Bernstein for performing elliptic curve point multi- 
plication k P using floating-point hardware for the NIST-recommended curve over Fp 
with p = 2774 — 2°64 1. All of the performance improvements are in field arithmetic 
(and in the organization of field operations in point doubling and addition). On the 
Pentium, which can use a 64-bit significand, field elements were represented as 


a=) 4 gE ya 


i 


where |a;| is permitted to be somewhat larger than 2?’ (as outlined at the end of §5.1.1). 
In comparison with the representation as a vector of 32-bit positive integers, this rep- 
resentation is not unique, and an additional word is required. Field multiplication will 
require more than 64 (floating-point) multiplications, compared with 49 in the classical 
method. On the positive side, more registers are available, multiplication can occur on 
any register, and terms a;b; may be directly accumulated in a register without any carry 
handling. 


Field arithmetic Field multiplication and (partial) reduction is performed simulta- 
neously, calculating c = ab from most-significant to least-significant output word. 
Portions of the code for computing the term c, are of the form 


r2 <— ig jan Uj 
r| <— fpe4(r2 + aK) — ax 


'0 <-'l2—-T] 


where r; are floating-point registers and ay = 3 -2°0+?8*, Roughly speaking, the ad- 
dition and subtraction of a, is an efficient way to extract bits from a floating-point 
number. Consider the case k = 0 and that rounding is via round-to-nearest. If rz is a 
64-bit floating-point value with |r2| < 2 then r; € 278Z, lro| < 277 andro =ro+ri. 
Figure 5.1 shows the values for r; and rg when 0 < r2 = v- 278 +. 4 < 2” and the case 


u = 2°’ is handled by a “round-to-even” convention. 
27 
u<2 
ml 
I et 


91 89 28 27 0 


u = 22! and v even 


<— 64 bits —> — 








(adatro=atv-2%4y (b) ry = fpgg(r2 +a) —O, 19 = 12-1] 


Figure 5.1. Splitting of a 64-bit floating-point number rp for the case 0 < ry = v-278 +u < 2% 
anda = 3-2”. The round-to-nearest convention is used, with round-to-even when u = 927) 
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The ro and 7; calculated for a given k > 7 are folded into lower-order terms. The first 
step finds r2 = a7b7, and then c14 = ro and c15 = 11. Let Cy = cy -2~8* and consider 


Cy = cf 228k = of .228(k—-8) 9224 


= cl 2786-8) (5 4.976 — 1) 


=, 20g = ol 78-8) God p), 


This says that cz -2—128 is added to cys, and cy -2~24 is subtracted from cx; for 
example, c;5 is folded into cjg and c7. The process eventually produces a partially 
reduced product c = ab as a vector of eight floating-point values. 


Curve arithmetic Bernstein’s point multiplication method for computing kP uses a 
width-4 window method (without sliding), with an expected 3 + (15/16)(224/4) point 
additions.* On the Pentium I/II, point multiplication required roughly 730,000 cy- 
cles, significantly faster than other implementations reported in the literature. Most of 
the improvement may be obtained by scheduling only field multiplication and squar- 
ing. However, the point arithmetic was organized so that some operations could be 
efficiently folded into field multiplication; for example, the field arithmetic for point 
doubling (x2, y2, Z2) = 2(X1, y1, Z1) is organized as 


Sect, yoy? Boxy, a—3(x1—8)(x1 +8) 
xo 07 — 88, z2<—(y +21)? -y —4, y2 —a(4B — x2) — 8y? 


requiring three multiplications, five squarings, and seven reductions. Conversion of 
the output to canonical form is expensive, but is done only at the end of the point 
multiplication. 


Programming considerations Except for a fragment to set the floating-point control 
register, all of the code is in C. However, the scheduling and management of regis- 
ters is processor-specific, and involves some of the same work necessary for assembly 
language versions. There are also a number of requirements on the development tools. 
It is essential that 80-bit extended-double registers not be unexpectedly spilled to 64- 
bit doubles by the compiler. Typically, data must be aligned properly (e.g., on 8-byte 
boundaries), and some environments do not manage this properly. Alignment for au- 
tomatic variables may require extra steps. An alternate strategy using SIMD integer 
capabilities is discussed in §5.1.3. 


3 The reference implementation processes k as k = aa oki 2% where —8 < k; <8. The precomputation 
phase stores iP in Chudnovsky coordinates (X:Y:Z:Z?:Z3) for nonzero i € [—8, 8), requiring three point 
squarings and three point doublings. The excessive storage is not essential for performance. 
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5.1.3. SIMD and field arithmetic 


Single-instruction multiple-data (SIMD) capabilities perform operations in parallel on 
vectors. In the Intel Pentium family (see Table 5.1), such hardware is present on all 
but the original Pentium and the Pentium Pro. The features were initially known as 
“MMX Technology” for the multimedia applications, and consisted of eight 64-bit 
registers, operating on vectors with components of 1, 2, 4, or 8 bytes. The capabilities 
were extended in subsequent processors: streaming SIMD (SSE) in the Pentium III 
has 128-bit registers and single-precision floating-point arithmetic, and SSE2 extends 
SSE to include double-precision floating-point and integer operations in the Pentium 
4. Advanced Micro Devices (AMD) introduced MMX support on their K6 processor, 
and added various extensions in newer chips. 

In this section, we consider the use of SIMD capabilities on AMD and Intel proces- 
sors to speed field arithmetic. The general idea is to use these special-purpose registers 
to implement fast 64-bit operations on what is primarily a 32-bit machine. For binary 
fields, the common MMxX subset can be used to speed multiplication and inversion. 
For prime fields, the SSE2 extensions (specific to the Pentium 4) provide an alternative 
approach to the floating-point methods of §5.1.2. 


Binary field arithmetic with MMX 


The eight 64-bit MMX registers found on Pentium and AMD processors are relatively 
easy to employ to speed operations in binary fields Fy». Although restrictive in the 
functions supported, the essential shift and xor operations required for binary field 
arithmetic are available. The strengths and shortcomings of the MMX subset for field 
multiplication and inversion are examined in this section. 

Naively, the 64-bit registers should improve performance by a factor of 2 compared 
with code using only general-purpose 32-bit registers. In practice, the results depend on 
the algorithm and the coding method. Implementations may be a mix of conventional 
and MMX code, and only a portion of the algorithm benefits from the wide registers. 
Comparison operations produce a mask vector rather than setting status flags, and data- 
dependent branching is not directly supported. The MMX registers cannot be used 
to address memory. On the other hand, the Pentium has only eight general-purpose 
registers, so effective use of the extra registers may contribute collateral benefits to 
general register management. As noted in Table 5.2, there is no latency or throughput 
penalty for use of MMX on the Pentium I/II; on the Pentium 4, scheduling will be of 
more concern. 


Field multiplication Comb multiplication (Algorithm 2.36) with reduction is effi- 
ciently implemented with MMX. Consider the field F163, with reduction polynomial 
f(z) =2'8 42742642341. The precomputation step 1 uses MMX, and the accumu- 
lator C is maintained in six MMX registers; processing of the input a is accomplished 
with general-purpose registers. The algorithm adapts well to use of the wide registers, 
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since the operations required are simple xor and shifts, there are no comparisons on 
MMxX registers, and (for this case) the accumulator C can be maintained entirely in 
registers. Field multiplication is roughly twice the speed of a traditional approach. 


Field inversion For inversion, Algorithm 2.48 (a Euclidean Algorithm variant) was 
implemented. In contrast to multiplication, the inversion algorithm requires some op- 
erations which are less-efficiently implemented with MMX. A degree calculation is 
required in step 3.1, and step 3.3 requires an extra register load since the shift is by a 
non-constant value. Two strategies were tested. The first used MMX only on g; and go, 
applying conventional code to track the lengths of u and v and find degrees. The second 
strategy obtained somewhat better performance by using MMX for all four variables. 
Lengths of uw and v were tracked in 32-bit increments, in order to more efficiently 
perform degree calculations (by extracting appropriate 32-bit halves and passing to 
conventional code for degree). A factor 1.5 improvement was observed in comparison 
with a non-MMX version. 


Programming considerations Unlike the commitment required for use of floating- 
point registers as described in §5.1.2, the use of MMX capabilities may be efficiently 
isolated to specific routines such as field multiplication—other code in an elliptic 
curve scheme could remain unchanged if desired. Implementation in C may be done 
with assembly-language fragments or with intrinsics. Assembly-language coding al- 
lows the most control over register allocation and scheduling, and was the method 
used to implement Algorithm 2.36. Programming with intrinsics is somewhat similar 
to assembly-language coding, but the compiler manages register allocation and can 
perform optimizations. The inversion routines were coded with intrinsics. 

Intel provides intrinsics with its compiler, and the features were added to gcc-3.1. 
As in §5.1.2, data alignment on 8-byte boundaries is required for performance. The 
MM<X and floating point registers share the same address space, and there is a penalty 
for switching from MMX operations to floating-point operations. Code targeted for 
the Pentium 4 could use the SSE2 enhancements, which do not have the interaction 
problem with the floating-point stack, and which have wider 128-bit vector operations. 


SIMD and prime field arithmetic 


The Pentium III has eight 128-bit SIMD registers, and SSE2 extensions on the Pentium 
4 support operations on vectors of double-precision floating-point values and 64-bit 
integers. In contrast to the floating-point implementation described in §5.1.2, use of 
the integer SSE2 capabilities can be efficiently isolated to specific routines such as 
field multiplication. 

Multiplication in SSE2 hardware does not increase the maximum size of operands 
over conventional instructions (32 bits in both cases, giving a 64-bit result); how- 
ever, there are more registers which can participate in multiplication, the multiplication 
latency is lower, and products may be accumulated with 64-bit operations. With con- 
ventional code, handling carry is a bottleneck but is directly supported since arithmetic 
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operations set condition codes that can be conveniently used. The SSE2 registers are 
not designed for this type of coding, and explicit tests for carry are expensive. Imple- 
menting the operand-scanning multiplication of Algorithm 2.9 is straightforward with 
scalar SSE2 operations, since the additions may be done without concern for carry. The 
approach has two additions and a subsequent shift associated with each multiplication 
in the inner product operation (UV) <C[i+ j]+ A[i]- B[j]+ U. The total number 
of additions and shifts can be reduced by adapting the product-scanning approach in 
Algorithm 2.10 at the cost of more multiplications. To avoid tests for carry, one or both 
of the input values are represented in the form a = yao where W’ < 32 so that 
products may be accumulated in 64-bit registers. 


Example 5.2 (multiplication with SSE2 integer operations) Suppose inputs consist of 
integers represented as seven 32-bit words (e.g., in P-224 discussed in §5.1.2). A scalar 
implementation of Algorithm 2.9 performs 49 multiplications, 84 additions, and 49 
shifts in the SSE2 registers. If the input is split into 28-bit fragments, then Algorithm 
2.10 performs 64 multiplications, 63 additions, and 15 shifts to obtain the product as 
16 28-bit fragments. 

The multiprecision library GNU MP (see Appendix C) uses an operand-scanning 
approach, with an 11-instruction inner loop. The code is impressively compact, and 
generic in that it handles inputs of varying lengths. If the supplied testing harness is 
used with parameters favourable to multiplication times, then timings are comparable 
to those obtained using more complicated code. However, under more realistic tests, a 
product-scanning method using code specialized to the 7-word case is 20% faster, even 
though the input must be split into 28-bit fragments and the output reassembled into 
32-bit words. A straightforward SSE2 integer implementation of multiplication on 7- 
word inputs and producing 14-word output (32-bit words) requires approximately 325 
cycles, less than half the time of a traditional approach (which is especially slow on the 
Pentium 4 due to the instruction latencies in Table 5.2). 


5.1.4 Platform miscellany 


This section presents selected notes on optimization techniques and platform charac- 
teristics, some of which are specific to development environments the authors have 
used. Compiler-specific notes are restricted to those for the C programming language, 
a common choice when a higher-level language is used. Even if implementation in 
hand-crafted assembly (for performance) is planned, prototyping in a higher-level lan- 
guage may speed development and comparisons of algorithms. In this case, it will be 
desirable that the prototype provide meaningful benchmark and other information for 
performance estimates of assembly-language versions. 


Common optimization considerations 


We present basic performance considerations and techniques with wide applicability. 
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Loop unrolling Among common strategies for improving performance, loop un- 
rolling is one of the most basic and most profitable. Loops are expanded so that more 
operations are done per iteration, reducing the number of instructions executed but 
increasing code size. The longer sequences are generally better-optimized, especially 
in the case of fully-unrolled loops. As an example, the comb multiplication in Algo- 
rithm 2.36 can be done efficiently with an outer loop over the w-bit windows, and a 
completely unrolled inner loop to perform addition and shifting. 

Typically, user-specified options influence the amount of loop unrolling performed 
by the compiler. At the current state of compiler technology, this automatic method 
cannot replace programmer-directed efforts, especially when unrolling is combined 
with coding changes that reduce data-dependencies. 


Local data On register-poor machines such as the Intel Pentium, the consumption of 
registers to address data can frustrate optimization efforts. Copying data to the stack 
allows addressing to be done with the same register used for other local variables. Note 
that the use of a common base register can result in longer instructions (on processors 
such as the Pentium with variable-length instructions) as displacements increase. 


Duplicated code For some algorithms, duplicating code or writing case-specific frag- 
ments is effective. As an example, the Euclidean algorithm variants for inversion call 
for repeated interchange of the contents of arrays holding field elements. This can 
be managed by copying contents or interchanging pointers to the arrays; however, 
faster performance may be obtained with separate code fragments which are essentially 
identical except that the names of variables are interchanged. 

Similarly, case-specific code fragments can be effective at reducing the number of 
conditionals and other operations. The Euclidean algorithm variants, for example, have 
arrays which are known a priori to grow or shrink during execution. If the lengths can 
be tracked efficiently, then distinct code fragments can be written, and a transfer to the 
appropriate fragment is performed whenever a length crosses a boundary. A somewhat 
extreme case of this occurs with the Almost Inverse Algorithm 2.50, where two of the 
variables grow and two shrink. If t words are used to represent a field element, then 
t? length-specific fragments can be employed. In tests on the Intel Pentium and Sun 
SPARC, this was in fact required for the algorithm to be competitive with Algorithm 
2.48. 

Use of “bail-out” strategies can be especially effective with code duplication. The 
basic idea is to remove code which handles unlikely or contrived data, and transfer 
execution to a different routine if such data is encountered. Such methods can have dis- 
mal worst-case performance, but may optimize significantly better (in part, because less 
code is required). The technique is effective in the Euclidean Algorithm 2.48, where 
the “unlikely data” is that giving large shifts at step 3.3. 

Duplicated and case-specific coding can involve significant code expansion. Plat- 
form characteristics and application constraints may limit the use of such strategies. 
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Branch misprediction Conditional expressions can significantly degrade optimiza- 
tions and performance, especially if the outcome is poorly-predicted. Branch prediction 
in the Intel Pentium family, for example, was improved in the P6 processors, but the 
cost of misprediction is in fact much higher. Care must be exercised when timing rou- 
tines containing a significant number of conditional expressions. Timing by repeated 
calls with the same data can give wildly misleading results if typical usage differs. This 
is easily seen in OEF arithmetic if implemented in the natural way suggested by the 
mathematics, and in the routine described in §3.6.2 for solving x?+x=cin binary 
fields, since branch prediction will be very poor with realistic data. 

Techniques to reduce the number of frequently-executed poorly-predicted condition- 
als include algorithm changes, table-lookup, and specialized instructions. In the case 
of OFF multiplication in §2.4.2, the natural method which performs many subfield op- 
erations is replaced by an algorithm with fewer conditional expressions. Table-lookup 
is a widely used method, which is effective if the size of the table is manageable. 
(Table-lookup can eliminate code, so the combined code and table may require less 
storage than the non-table version.) The method is effective in Algorithm 3.86 for 
solving x7 +x = c, eliminating conditionals at step 3 and processing multiple bits 
concurrently. Finally, the specialized instructions are illustrated by the Pentium II or 
later, which contain conditional move and other instructions eliminating branching at 
the cost of some dependency between instructions. 


Assembly coding Performance considerations, shortcuts, register allocation, and ac- 
cess to platform features are often sufficiently compelling to justify coding critical 
sections in assembler. If many platforms must be supported, coding entire routines 
may involve significant effort—even within the same family of processors, different 
scheduling may be required for best performance. 

Consider the multiply-and-accumulate fragment (5.1). This is commonly coded in 
assembler for two reasons: some compilers do not process the 2W-bit product from W- 
bit input efficiently, and instructions that access the carry flag rather than explicit tests 
for carry should be used. In longer fragments, it may also be possible to outperform the 
compiler in register allocation. 

Inline assembly, supported by some compilers, is especially desirable for inserting 
short fragments. As an example, the Euclidean Algorithm 2.48 requires polynomial 
degree calculations. A relatively fast method uses a binary search and table lookup, 
once the nonzero word of interest is located. Some processors have instruction sets 
from which a fast “bit scan” may be built: the Pentium has single instructions (bsr and 
bsf) for finding the position of the most or least significant bit in a word.* Similarly, 
Sun suggests using a Hamming weight (population) instruction to build a fast bit scan 
from the right for the SPARC. The GNU C and Intel compilers work well for inlining 
such code, since it is possible to direct cooperation with surrounding code. In contrast, 


“The number of cycles required by the bit scan instructions varies across the Pentium family. The floating 
point hardware can be used to provide an alternative to bit scan. 
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the Microsoft compiler has only limited support for such cooperation, and can suffer 
from poor register management. 


Compiler characteristics and flaws 


The remaining notes in this section are decidedly platform-specific. The compilers ref- 
erenced are GNU C (gcec-2.95), Intel (6.0), Microsoft (6.0), and Sun Workshop (6U2), 
producing 32-bit code for the Intel Pentium family and 32- or 64-bit code for the Sun 
UltraSPARC. 


Scalars vs arrays Some compilers will produce slower code when arrays are used 
rather than scalars (even though the array indices are known at compile-time). Among 
the compilers tested, GNU C exhibits this optimization weakness. 


Instruction scheduling Compared with the Intel and Sun compilers, GNU C is 
weaker at instruction scheduling on the Pentium and SPARC platforms, but can be 
coerced into producing somewhat better sequences by relatively small changes to the 
source. In particular, significantly different times were observed in tests with Algorithm 
2.36 on SPARC with minor reorganizations of code. The Sun Workshop compiler is 
less-sensitive to such changes, and generally produces faster code. 

On the Intel processors, scheduling and other optimizations using general-purpose 
registers are frustrated by the few such registers available. A common strategy is to 
allow the frame pointer (ebp) to be used as a general-purpose register; in GNU C, this 
is ‘-fomit-frame-pointer’. 


Alignment Processors typically have alignment requirements on data (e.g., 32-bit in- 
tegers appear on 4-byte boundaries), and unaligned accesses may fault or be slow. This 
is of particular concern with double-precision floating-point values and data for SIMD 
operations, since some environments do not manage the desired alignment properly. 
It is likely that these shortcomings will be corrected in subsequent releases of the de- 
velopment tools. Regardless, alignment for automatic (stack) variables may require 
additional steps. 


Flaws Despite the maturity of the compilers tested, it was relatively easy to uncover 
weaknesses. For example, an apparent optimization flaw in the Sun Workshop compiler 
was triggered by a small code change in the 64-bit implementation of Algorithm 2.36, 
causing shifts by 4 to be processed as multiplication by 16, a much slower operation 
on that platform. Workarounds include post-processing the assembler output or using 
a weaker optimization setting. 

Significant optimization problems were observed in the Microsoft compiler con- 
cerning inlining of C code; in particular, multiplication in a short OEF routine would 
sometimes be replaced by a function call. This bug results in larger and much slower 
code. The widely-used Microsoft compiler produces code which is competitive with 
that of the Intel compiler (provided no bugs are triggered). However, the limited ability 
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for inline assembler to cooperate with surrounding code is a design weakness compared 
with that of GNU C or the Intel compilers, which have the additional advantage that 
they can be used on Unix-like systems. 


5.1.5 Timings 


Selected field operation timings are presented for Intel Pentium family processors and 
the Sun UltraSPARC, commonly used in workstations. The NIST recommended binary 
and prime fields (§A.2) are the focus, although some data for an OEF (§2.4) is presented 
for comparison. 

It is acknowledged that timings can be misleading, and are heavily influenced by 
the programmer’s talent and effort (or lack thereof), compiler selection, and the precise 
method of obtaining the data. The timings presented here should be viewed with the 
same healthy dose of skepticism prescribed for all such data. Nonetheless, timings are 
essential for algorithm analysis, since rough operation counts are often insufficient to 
capture platform characteristics. For the particular timings presented here, there has 
generally been independent “sanity check” data available from other implementations. 

Tables 5.3-5.5 give basic comparisons for the NIST recommended binary and prime 
fields, along with a selected OEF. Inversion and multiplication times for binary fields 
on two platforms appear in Table 5.6, comparing compilers, inversion algorithms, and 
32-bit vs 64-bit code. The 64-bit code on the Intel Pentium III is via special-purpose 
registers. These capabilities were extended in the Pentium 4, and Table 5.7 includes 
timings for prime field multiplication via these registers along with an approach using 
floating-point registers. 


Field arithmetic comparisons 


Timings for the smallest of the NIST recommended binary and prime fields, along 
with an OFF, are presented in Table 5.3. Specifically, these are the binary field F163 
with reduction polynomial f(z) = z! +z’ +z2°+23 +1, the prime field F,,,.. with 
Pi92 = 2!°? — 2% — 1, and the OEF F 6 with prime p = 23! — 1 and reduction poly- 
nomial f(z) = z°—7. Realistic branch misprediction penalties are obtained using a 
sequence of pseudo randomly generated field elements, and the timings include frame- 
work overhead such as function calls. The Intel compiler version 6 along with the 
Netwide Assembler (NASM) were used on an Intel Pentium III running the Linux 2.2 
operating system. 

Algorithms for binary fields were coded entirely in C except for a one-line assembler 
fragment used in polynomial degree calculations in inversion. Assembly coding may 
be required in prime fields and OEFFs in order to use hardware multipliers producing 
a 64-bit product from 32-bit input, and to directly access the carry bit, both of which 
are essential to performance in conventional methods. The first of the F,,.. columns 
in Table 5.3 gives timings for code written primarily in C. For most entries, a signifi- 


220 5. Implementation Issues 


a 
F263 F pop Foi F3!~1° 


Addition 0.04 0.18 0.07 0.06 
Reduction 

Fast reduction 0.115 = 0.25 0.11 N/A 

Barrett reduction (Algorithm 2.14) N/A 1.554 0.49 N/A 
Multiplication (including fast reduction) 1.30° 0.5754 0.42! 0.408 
Squaring (including fast reduction) 0.204 — 0.36! 0.328 
Inversion 10.5) 58.3% 25.2k 2.9! 
I/M 8.1 102.3 60.0 7.3 


®Coded primarily in C. °Algorithm 2.41. ©Algorithm 2.27. “Uses a 32x32 multiply-and-add. 


° Algorithm 2.36, ‘Algorithm 2.10. SExample 2.56. "Algorithm 2.39. 
‘Algorithm 2.13. JAlgorithm 2.48. ‘Algorithm 2.22. 'Algorithm 2.59. 


Table 5.3. Timings (in js) for field arithmetic on an 800 MHz Intel Pentium III. The binary field 
F163 = Fo[z]/(z'® +27 4794734 1) and the prime field F p,, for pj92 = 2192 _ 964 _ | are 
from the NIST recommendations (§A.2). The rightmost column is the optimal extension field 
F po = F plzl/(z® —7) for prime p = l_y. 


cant penalty is seen relative to the timings with assembly. However, the multiplication 
routine uses an in-line assembly fragment for a 32 x32 multiply with a three-word ac- 
cumulation. If reduction is excluded, the time is very close to that obtained with the 
assembly language version, an indication that the Intel compiler handles insertion of 
short in-line assembly fragments well. 


Reduction Barrett reduction does not exploit the special form of the NIST prime, and 
the entries can be interpreted as rough cost estimates of reduction with arandom 192-bit 
prime. In contrast to special primes, this estimate shows that reduction is now a very 
significant part of field multiplication timings, encouraging the use of Montgomery 
(§2.2.4) and other multiplication methods. Significant performance degradation in the 
C version of the fast reduction algorithm is largely explained by the many conditionals 
in the clumsy handling of carry. 


OEF The OEF F23!_1)¢ in the rightmost column of Table 5.3 is roughly the same size 
as F'p,9,. The multiplication is accomplished with an accumulation method (Example 
2.56) resembling the method used in F’,,,,, and the resulting times are comparable. As 
expected, inversion is significantly faster for the OEF. 


NIST fields Tables 5.4 and 5.5 provide timings for the NIST recommended binary 
and prime fields. Note that optimizations in the larger fields were limited to tech- 
niques employed for F213 and F ,,,,. In particular, Karatsuba-Ofman methods were not 
competitive in our tests on this platform for the smaller fields, but were not examined 
carefully in the larger fields. 
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IF9163 P5233 P9283 P9409 F571 


Addition 0.04 0.004 0.04 0.06 0.07 
Reduction (Algorithms 2.41—2.45) 0.11 0.13 0.19 0.14 0.33 
Multiplication (Algorithm 2.36) 1.30 2.27 2.92 5.53 10.23 
Squaring (Algorithm 2.39) 0.20 0.23 0.32 0.31 0.56 
Inversion (Algorithm 2.48) 10.5 18.6 28.2 53.9 96.4 
I/M 8.1 8.2 9.7 9.8 9.4 


Table 5.4. Timings (in jvs) for binary field arithmetic on an 800 MHz Intel Pentium III, including 
reduction to canonical form. The fields are from the NIST recommendations (§A.2) with reduc- 
tion polynomials z!® pe eee 1 ee Le ae 4a a A, 
and z>71 4.2104 29 42241, respectively. 








i P192 i P224 i P256 i P384 i P521 


Addition 0.07 0.07 0.08 0.10 0.10 
Reduction (Algorithms 2.27—2.31) 0.11 0.12 0.30 0.38 0.20 
Multiplication (Algorithm 2.10) 0.42 0.52 0.81 1.47 2.32 
Squaring (Algorithm 2.13) 0.36 0.44 0.71 1.23 1.87 
Inversion (Algorithm 2.22) 25.2 34.3 44.3 96.3 163.8 
I/M 60.0 70.0 54.7 ~ 65.5 70.6 


Table 5.5. Timings (in zs) for prime field arithmetic on an 800 MHz Intel Pentium III, in- 
cluding reduction to canonical form. The fields are from the NIST recommendations (§A.2) 
with P192 = 2192 = 264 =i p24 = 224 = 296 etal P56 = 2256 9224 2192 296 1, 
p3g4 = 2384 — 2128 _ 296 4.932 1, and psy, = 252! -1. 





Multiplication and inversion in binary fields 


In point multiplication on elliptic curves (§3.3), the cost of field inversion relative to 
field multiplication is of particular interest. This section presents estimates of the ratio 
for the NIST binary fields (where the ratio is expected to be relatively small) for two 
platforms. The three inversion methods discussed in §2.3.6 are compared, along with 
timings for 32-bit and 64-bit code. The results also show significant differences among 
the compilers used. 

Table 5.6 gives comparative timings on two popular platforms, the Intel Pentium IIT 
and Sun UltraSPARC Ile. Both processors are capable of 32- and 64-bit operations, 
although only the UltraSPARC is 64-bit. The 64-bit operations on the Pentium III are 
via the single-instruction multiple-data (SIMD) registers, introduced on the Pentium 
MMxX (see Table 5.1). The inversion methods are the extended Euclidean algorithm 
(EEA) in Algorithm 2.48, the binary Euclidean algorithm (BEA) in Algorithm 2.49, 
and the almost inverse algorithm (AIA) in Algorithm 2.50. The example fields are 
taken from the NIST recommendations, with reduction polynomials f(z) = z!° + 
zg) 42642341 and f(z) = 273 +274 +1. Both allow fast reduction, but only the latter 
is favourable to the almost inverse algorithm. Field multiplication based on the comb 
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Pentium III (800 MHz) | SPARC (500 MHz) 
32-bit 64-bit 32-bit 64-bit 
Algorithm gcc icc mmx | gcc cc 


Arithmetic in F163 
multiplication 

Euclidean algorithm 
binary Euclidean algorithm 


almost inverse 

I/M 

Arithmetic in F 7233 
multiplication 

Euclidean algorithm 
binary Euclidean algorithm 
almost inverse 

I/M 





Table 5.6. Multiplication and inversion times for the Intel Pentium III and Sun UltraSPARC Ile. 
The compilers are GNU C 2.95 (gcc), Intel 6 (icc), and Sun Workshop 6U2 (cc). The 64-bit 
“multimedia” registers were employed for the entries under “mmx.” Inversion to multiplication 
(I/M) uses the best inversion time. 


method (Algorithm 2.36) appears to be fastest on these platforms. A width-4 comb 
was used, and the times include reduction. Other than the MMX code and a one-line 
assembler fragment for EEA, algorithms were coded entirely in C. 

Some table entries are as expected, for example, the relatively good times for almost 
inverse in F233. Other entries illustrate the significant differences between platforms 
or compilers on a single platform. Apparent inconsistencies remain in Table 5.6, 
but we believe that the fastest times provide meaningful estimates of inversion and 
multiplication costs on these platforms. 


Division The timings do not make a very strong case for division using a modification 
of the BEA (§2.3.6). For the 32-bit code, unless EEA or AIA can be converted to 
efficiently perform division, then only the entry for F163 on the SPARC supports use 
of BEA-like division. Furthermore, the ratio //M is at least 8 in most cases, and hence 
the savings from use of a division algorithm would be less than 10%. With such a ratio, 
elliptic curve methods will be chosen to reduce the number of inversions, so the savings 
on a point multiplication k P would be significantly less than 10%. 

On the other hand, if affine-only arithmetic is in use in a point multiplication method 
based on double-and-add, then a fast division would be especially welcomed even if 
I/M is significantly larger than 5. If BEA is the algorithm of choice, then division has 
essentially the same cost as inversion. 


Implementation notes General programming considerations for the implementations 
used here are covered in §5.1.4. In particular, to obtain acceptable multiplication times 


5.1. Software implementation 223 


with gcc on the Sun SPARC, code was tuned to be more “gcec-friendly.” Limited tuning 
for gcc was also performed on the inversion code. Optimizing the inversion code is te- 
dious, in part because rough operation counts at this level often fail to capture processor 
or compiler characteristics adequately. 


Multimedia registers The Intel Pentium family (all but the original and the Pentium 
Pro) and AMD processors possess eight 64-bit “multimedia” registers that were em- 
ployed for the timings in the column marked “mmx.” Use of these capabilities for field 
arithmetic is discussed in §5.1.3. 


EEA Algorithm 2.48 requires polynomial degree calculations. On the SPARC, de- 
gree was found by binary search and table lookup, once the nonzero word of interest 
is located. On the Pentium, a bit scan instruction (bsr) that finds the position of the 
most significant bit in a word was employed via in-line assembly, resulting in an 
improvement of approximately 15% in inversion times. 

The code tracks the lengths of u and v using ¢ fragments of similar code, each frag- 
ment corresponding to the current “top” of u and v. Here, t was chosen to be the number 
of words required to represent field elements. 


BEA Algorithm 2.49 was implemented with a t-fragment split to track the lengths of 
u and v efficiently. Rather than the degree calculation indicated in step 3.3, a simpler 
comparison on the appropriate words was used. 


AIA Algorithm 2.50 allows efficient tracking of the lengths of g1 and go (in addition 
to the lengths of u and v). A total of t* similar fragments of code were used, a signif- 
icant amount of code expansion unless f is small. As with BEA, a simple comparison 
replaces the degree calculations. Note that only the reduction polynomial for F233 is 
favourable to the almost inverse algorithm. 


Prime field multiplication methods 


For prime fields, traditional approaches for field multiplication are often throttled by 
limitations of hardware integer multipliers and carry propagation. Both the Ultra- 
SPARC and the Pentium family processors suffer from such limitations. The Intel 
Pentium 4 is in fact much slower (in terms of processor cycles) in some operations 
than the preceding generation of Pentium processors. As an example, field multiplica- 
tion in F,,,, using Algorithm 2.10 with code targeted at the Pentium I/II] appears in 
Table 5.5 (from a Pentium HI) and Table 5.7 (from a Pentium 4). Despite a factor 2 
clock speed advantage for the Pentium 4, the timing is in fact slower than obtained on 
the Pentium III. 


Karatsuba-Ofman Methods based on Karatsuba-Ofman do not appear to be com- 
petitive with classical methods on the Pentium I/II] for fields of this size. Table 5.7 
includes times on the Pentium 4 using a depth-2 approach outlined in Example 2.12. 
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Multiplication in F p54 Time (1s) 
Classical integer (Algorithm 2.10) 0.62 
Karatsuba-Ofman (Example 2.12) 0.82 
SIMD (Example 5.2) 0.27 
Floating-point (P-224 in §5.1.2) 0.208 


4Excludes conversion to/from canonical form. 


Table 5.7. Multiplication in F py», for the 224-bit NIST prime p24 = 274 —2°° +1 ona 1.7 GHz 
Intel Pentium 4. The time for the floating-point version includes (partial) reduction to eight 
floating-point values, but not to or from canonical form; other times include reduction. 


The classical and the Karatsuba-Ofman implementations would benefit from additional 
tuning specifically for the Pentium 4; regardless, both approaches will be inferior to the 
methods using special-purpose registers discussed next. 


Floating-point arithmetic A strategy with wide applicability involves floating-point 
hardware commonly found on workstations. The basic idea, discussed in more detail in 
§5.1.2, is to exploit fast floating-point capabilities to perform integer arithmetic using a 
suitable field element representation. In applications such as elliptic curve point multi- 
plication, the expensive conversions between integer and floating-point formats can be 
limited to an insignificant portion of the overall computation, provided that the curve 
operations are written to cooperate with the new field representation. This strategy is 
outlined for the NIST recommended prime field F,,,., for poo4 = 2774 —2°° +1 in 
§5.1.2. Timings for multiplication using a floating-point approach on the Pentium 4 are 
presented in Table 5.7. Note that the time includes partial reduction to eight floating- 
point values (each of size roughly 28 bits), but excludes the expensive conversion to 
canonical reduced form. 


SIMD Fast multiplication can also be built using the single-instruction multiple-data 
(SIMD) registers on the Pentium 4. The common MMxX subset was noted in the pre- 
vious section for binary field arithmetic, and SSE2 extensions on the Pentium 4 are 
suitable for integer operations on vectors of 64-bit integers. §5.1.3 discusses the spe- 
cial registers in more detail. Compared with the floating-point approach, conversion 
between the field representation used with the SIMD registers and canonical form is 
relatively inexpensive, and insertion of SIMD code into a larger framework is rela- 
tively painless. The time for the SIMD approach in Table 5.7 includes the conversions 
and reduction to canonical form. 


5.2. Hardware implementation 


In some applications, a software implementation of an elliptic curve cryptographic 
scheme at required security levels may not provide the desired performance levels. 
In these cases it may be advantageous to design and fabricate hardware accelerators to 
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meet the performance requirements. This section gives an introduction to hardware im- 
plementation of elliptic curve systems. The main design issues are discussed in §5.2.1. 
Architectures for finite field processors are introduced in §5.2.2. We begin with an 
overview of some basic concepts of hardware design. 


Gate A gate is a small electronic circuit that modifies its inputs and produces a single 
output. The most common gate has two inputs (but may have more). Gates com- 
prise the basic building blocks of modern computing devices. The most common 
gates are NOT (inverting its input), NAND (logical AND of two inputs followed 
by inversion), NOR (logical OR of two inputs followed by inversion), and their 
more costly cousins AND and OR. Gate count typically refers to the equivalent 
numbers of 2-input NAND gates. 


VLSI Very large scale integration (VLSI) refers to the building of circuits with gate 
counts exceeding 10,000. A VLSI circuit starts with a description in VHDL, 
Verilog, or other hardware-description languages that is compiled either into in- 
formation needed to produce the circuit (known as synthesis) or into source code 
to be run on general-purpose machines (known as a simulation). The design of 
VLSI circuits involves a trade-off between circuit-delay caused by the speed of 
signal propagation and power dissipation. Judicious layouts of the physical cir- 
cuit affect both. Other tools available include layout editors to assist with block 
placement and timing-analysis tools to tune the design. These custom designs 
can be costly in terms of time, money and other resources. 


FPGA A field-programmable gate array (FPGA) consists of a number of logic blocks 
each of which typically contains more than a single gate and interconnections 
between them. These can be converted into circuits by judicious application 
of power to close or open specific electrical paths. In essence, the FPGA is 
programmed. The change is reversible, allowing circuits to be created and mod- 
ified after manufacture (hence “field-programmable”’). An FPGA can be large 
with a sea of gates numbering 20,000 or more. FPGAs were originally in- 
troduced as a means of prototyping but are increasingly being used to create 
application-specific circuits that will often outperform binary code running on 
generic processors. Programming is typically done with vendor-specific tools 
similar to those used in creating VLSI circuits. 


Gate Array A gate array consists of a regular array of logic blocks where each 
logic block typically contains more than a single gate and also interconnections 
between these blocks. Circuits are formed by judiciously fusing connections be- 
tween blocks. This process is irreversible. With the advent of FPGAs that provide 
considerably more flexibility, gate array technology seems to be used far less. 


ASIC Application-specific integrated circuit (ASIC) is the terminology used in regard 
to VLSI or gate array. 
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Multiplexor A multiplexor is a multiple-input single-output device with a con- 
troller that selects which input becomes the output. These devices provide the 
conditional control of a circuit. 


Pipelining Pipelining is a design feature that allows a second computation to begin 
before the current computation is completed. 


Parallel Processing Parallel processing is a technique that permits two or more 
computations to happen simultaneously. 


5.2.1 Design criteria 


The operation that dominates the execution time of an elliptic curve cryptographic 
protocol is point multiplication. Efficient implementation of point multiplication can 
be separated into three distinct layers: 


1. finite field arithmetic (Chapter 2); 
2. elliptic curve point addition and doubling (§3.2); and 
3. point multiplication technique ($3.3). 


Accordingly, there is a hierarchy of operations involved in point multiplication with 
point multiplication techniques near the top and the fundamental finite field arithmetic 
at the base. The hierarchy, depicted in Figure 5.2, has been extended to the proto- 
col level. For example, one could decide to implement ECDSA signature generation 
(§4.4.1) entirely in hardware so that the only input to the device is the message to be 
signed, and the only output is the signature for that message. 


Protocols 


Point 
multiplication 





Elliptic curve 
addition and doubling 





Finite field arithmetic 





Figure 5.2. Hierarchy of operations in elliptic curve cryptographic schemes. 


An important element of hardware design is to determine those layers of the hier- 
archy that should be implemented in silicon. Clearly, finite field arithmetic must be 
designed into any hardware implementation. One possibility is to design a hardware 
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accelerator for finite field arithmetic only, and then use an off-the-shelf microprocessor 
to perform the higher-level functions of elliptic curve point arithmetic. It is important to 
note that an efficient finite field multiplier does not necessarily yield an efficient point 
multiplier—all layers of the hierarchy need to be optimized. 


Moving point addition and doubling and then point multiplication to hardware pro- 
vides a more efficient ECC processor at the expense of more complexity. In all cases a 
combination of both efficient algorithms and hardware architectures is required. 


One approach to higher functionality is the processor depicted in Figure 5.3. Along 
with program and data memory, the three main components are an arithmetic logic 
unit (AU), an arithmetic unit controller (AUC), and a main controller (MC). The AU 
performs the basic field operations of addition, squaring, multiplication, and inversion, 
and is controlled by the AUC. The AUC executes the elliptic curve operations of point 
addition and doubling. The MC coordinates and executes the method chosen for point 
multiplication, and interacts with the host system. 
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Figure 5.3. Elliptic curve processor architecture. 


Let’s consider how a higher functionality processor might handle the computation 
of kP for a randomly chosen integer k. The host commands the processor to generate 
kP where the integer k and the (affine) coordinates of P are provided by the host. The 
integer k is loaded into the MC, and the coordinates of P are loaded into the AU. The 
MC instructs the AUC to do its initialization which may include converting the affine 
coordinates of P to projective coordinates needed by the point addition and doubling 
formulae. The MC scans the bits of & and instructs the AUC to perform the appropriate 
elliptic curve operations, which in turn instructs the AU to perform the appropriate 
finite field operations. After all bits of k are processed, the MC instructs the AUC to 
convert the result back to affine coordinates. The host reads the coordinates of kP 
from the registers in the AU. Two important consequences of having two controllers 
are the ability to permit parallel processing and pipelining of operations. The MC can 
also use the data storage capability to implement algorithms that use precomputation 
to compute k P more efficiently (see §3.3). 
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Criteria for selecting hardware designs 


The following are some of the issues that have to considered in hardware design. It 
should be emphasized that a good design demands a thorough understanding of the 
target platform, operating environment, and performance and security requirements. 


1. 


10. 


Cost is always a significant issue with hardware designers, and is driven by all 
of the criteria that follow. 


. Hardware vs. software. Is there a compelling argument to choose a hardware 


accelerator over a software implementation? 


. Throughput. A device that will be installed into a server will likely need to do 


hundreds or thousands of elliptic curve operations per second whereas devices 
designed for handheld computers will require only a small fraction of this. 


. Complexity. The more levels of the hierarchy that the device implements, the 


more complex the circuitry becomes. This translates into more silicon area on a 
custom VLSI device or a much larger FPGA. It will also result in higher cost. 


. Flexibility. Issues pertinent here include the ability of the device to perform 


computations on curves over binary fields and prime fields. 


. Algorithm agility. Many cryptographic protocols require cryptographic algo- 


rithms to be negotiated on a per-session basis (e.g., SSL). Reconfigurable 
hardware might be an attractive feature provided that performance is not 
significantly impacted. 


. Power consumption. Depending on the environment where the device will op- 


erate, power consumption may or may not be a major issue. For example, 
contactless smart cards are very constrained by the amount of power available 
for cryptographic operations whereas a server can afford much higher power 
consumption. 


. Security should always be paramount in any design consideration. If the device 


is designed to perform only point additions and doublings, then it is activated 
during a point multiplication k P by the bits associated with the random value k. 
Without careful design of the overall architecture, bits of k could be leaked by 
side-channel attacks. Countermeasures to attacks based on timing, power analy- 
sis, and electromagnetic radiation (see §5.3) should be considered based on the 
environment in which the device will operate. 


. Overall system architecture. If the overall system has a microprocessor with 


enough free cycles to handle protocol functionality above finite field arithmetic 
(see Figure 5.2), then, depending on other criteria, this may be good reason to 
design the device for finite field arithmetic only. 


Implementation platform. A custom VLSI or gate array design or an FPGA 
may be used. FPGAs typically have a high per unit cost versus VLSI and 
gate array devices. Design costs are however significantly higher for VLSI 
implementations. 


5.2. Hardware implementation 229 


11. Scalability. If it is desirable that the device can provide various levels of security 
(for example by implementing all the NIST curves in §A.2), then one must design 
the underlying finite field processor to accommodate variable field sizes. 


The relative importance of these design criteria depends heavily on the application. 
For example, cost is less of a concern if the hardware is intended for a high-end server 
than if the hardware is intended for a low-end device such as a light switch. Table 5.8 
lists design criteria priorities for these two extreme situations. 


High-end device Low-end device 


High priority Low priority High Priority Low priority 
Throughput Cost Cost Throughput 
Security Power consumption | Hardware vs. software Flexibility 
Scalability Complexity Complexity Algorithm agility 


System architecture Power consumption Scalability 


Implementation platform Security 
Algorithm agility System architecture 
Flexibility Implementation platform 





Hardware vs. software 


Table 5.8. Priorities for hardware design criteria. 


5.2.2 Field arithmetic processors 


This section describes hardware circuits for performing addition, multiplication, squar- 
ing, and inversion operations in a binary field F2”. The operations in Fz” are typically 
easier to implement in hardware than their counterparts in prime fields F, because bit- 
wise addition in Fy» does not have any carry propagation. Moreover, unlike the case of 
Fy, squaring in F’, is roughly as costly as a general multiplication. As a consequence 
of squaring being more expensive in F’, than F 2”, inversion using multiplication (as 
described below for Fz”) is slower in Fp. 


Addition 


Recall from §2.3.1 that addition of elements in a binary field Fo” is performed bitwise. 
There is no carry propagation, and hence addition in F 2» is considerably simpler to 
implement in hardware than addition in prime fields F ,. 


Multiplication 


We discuss the design of a hardware circuit to multiply elements in a binary field F. 
We shall only consider the case where the elements of F2” are represented with respect 
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to a polynomial basis. If f(z) is the reduction polynomial, then we write 


f(z) =z" 4+r(z), where degr <m—1. 
Moreover, if r(z) = rm—12"~! +--+ +12z*+r1z+ro, then we represent r(z) by the 
binary vector 





r= ("m—1,--+512,11,10)- 


A multiplier is said to be bit-serial if it generates one bit of the product at each clock 
cycle. It is digit-serial if it generates more than one bit of the product at each clock 
cycle. We present bit-serial multipliers for the three cases: 


(i) fixed field size with arbitrary reduction polynomial; 
(ii) fixed field size with fixed reduction polynomial; and 
(iii) variable field size (with arbitrary or fixed reduction polynomials). 
We also describe a digit-serial multiplier for the fourth case: 
(iv) fixed field size with fixed reduction polynomial. 


In Figures 5.4—5.11, the following symbols are used to denote operations on bits A, 


















































B,C: 
A B A B 
A A \_/ \ / 
Cc Cc Cc Cc 
C<A C<AOC C<C@ (A&B) C<AOB 


(i) Fixed field size with arbitrary reduction polynomial Algorithm 5.3, which mul- 
tiplies a multiplicand a € F2” and a multiplier b € Fy”, processes the bits of b from left 
(most significant) to right (least significant). The multiplier, called a most significant 
bit first (MSB) multiplier, is depicted in Figure 5.4 for the case m = 5. In Figure 5.4 
b is a shift register and c is a shift register whose low-end bit is tied to 0. An MSB 
multiplier can perform a multiplication in Fy» in m clock cycles. 


Algorithm 5.3 Most significant bit first (MSB) multiplier for Fa 


INPUT: a = (Gm—1,---, 41,40), 0 = (bm-1,..., 61, bo) € Fa, and reduction polynomial 
f() =z" +r(z). 
OUTPUT: c =a-b. 
1. Setc <0. 
2. For i from m— | downto 0 do 
2.1 c<leftshift(c) + ¢m_1r. 
2.2 c<ct+)ja. 
3. Return(c). 
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Figure 5.4, Most significant bit first (MSB) multiplier for F 55. 


Algorithm 5.4, which multiplies a multiplicand a € Fo» and a multiplier b € Fo, 
processes the bits of b from right (least significant) to left (most significant). The 
multiplier, called a least significant bit first (LSB) multiplier, is depicted in Figure 5.5. 


Algorithm 5.4 Least significant bit first (LSB) multiplier for Fa 


INPUT: a = (Gm—1,---, 41,40), 0 = (bm—1,..., 61, bo) € Fo”, and reduction polynomial 
f(@=z"4+r(z). 
OUTPUT: c =a-b. 


1. Setc <0. 
2. For i from 0 to m— 1 do 

2.1 c<-c+bja. 

2.2 a<leftshift(a) +am_1r. 
3. Return(c). 


One difference between the MSB and LSB multipliers is that the contents of two of 
the four registers in Figure 5.4 are not altered during a multiplication, while three of 
the four registers in Figure 5.5 are altered. In other words, the MSB multiplier only has 
to clock two registers per clock cycle, as compared to three for the LSB multiplier. 


(ii) Fixed field size with fixed reduction polynomial If the reduction polynomial 
J (z) is fixed and is selected to be a trinomial or pentanomial, then the design of the 
multiplier is significantly less complex since a register to hold the reduction polyno- 
mial is no longer needed. Figure 5.6 illustrates an MSB multiplier for F,5 with fixed 
reduction polynomial f(z) = z>+z7+1. 
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Figure 5.5. Least significant bit first (LSB) multiplier for F 55. 
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Figure 5.6. MSB multiplier with fixed reduction polynomial f (z) = D+274+1, 


(iii) Variable field size The MSB multiplier in Figure 5.4 can be extended to multiply 


elements in the fields Fam for m € {m,,mz,...,m;}, where mj < m2 <-+-+ <m;. Each 
register has length m,. Figure 5.7 illustrates an MSB multiplier that can implement 
multiplication in any field Fa» for m € {1,2,..., 10}, and for any reduction polyno- 


mial. Note that only the contents of registers b and c change at each clock cycle. The 
controller loads the bits of a, b and r from high-order to low-order and sets the unused 


bits to 0. Although the unused cells are clocked, they consume little power since their 
contents do not change. 


The circuit can be simplified if each field has a fixed reduction polynomial, prefer- 
ably a trinomial or a pentanomial. Figure 5.8 illustrates a variable field size MSB 
multiplier for Fs, Fy7, and F 10 with the fixed reduction polynomials z> + z* +1, 
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Figure 5.7. MSB multiplier for fields Fyn with 1 < m < 10. A multiplier for Fy6 is shown. 


z’+z+1, and z!04+ 734 1, respectively. A multiplexor is used to select the desired 
field. Loading registers and controlling the multiplexor is the function of the controller. 


(iv) Digit-serial multiplier for fixed field size with fixed reduction polynomial We 
consider multiplication of two elements a and b in Fy” where the multiplier b is 
expressed as a polynomial having / = [m/k] digits 


I-1 
i=0 


where each digit B; is a binary polynomial of degree at most k — 1. One way to express 
the product a - b is the following: 


I-1 
a-b=a (x a) mod f(z) 


i=0 
I-1 ; 
= (x B(az mod ro) mod f(z) 
i=0 


where f(z) is the reduction polynomial for Fo”. Algorithm 5.5 is a digit-serial 
multiplier derived from this observation. 
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Figure 5.8. MSB multiplier for fields Fy5, F47, and F519 with reduction polynomials +2741, 
zg? +zt+1, and z!9 47341. The multiplier for 55 is shown. 


Algorithm 5.5 Digit-serial multiplier for Fy 


INPUT: a= 9 az! € Fon, b= Ss =, Be € Fy, reduction polynomial f (z). 
OUTPUT: c=a- b. 
1. Setc <0. 


2. For i from 0 to/—1 do 
2.1 c<-c+ Bja. 
2.2 a<a-z* mod f(z). 
3. Return(c mod f(z)). 


A hardware circuit for executing Algorithm 5.5 consists of a shift register to hold 
the multiplicand a, another shift register to hold the multiplier b, and an accumulating 
register (not a shift register) to hold c. The registers holding a and b are each m bits 
in length, whereas c is (m +k — 1) bits long. At the ith iteration, the content of a is 
az‘ mod F(z). The product B; - (az! mod Ff (z)) is called a digit multiplication. The 
result of this digit multiplication is at most m +k — 1 bits in length and is XORed into 
the accumulator c. If the circuit can compute az™ mod F(z) and B; - (azk' mod f(z) 
in a single clock, then the entire multiplication can be completed in / clock cycles. 
While the complexity of the circuit increases with k, a k-fold speedup for multiplication 
can be achieved. 

Figure 5.9 shows the a register for a 2-digit multiplier for F,; where the field is 
defined by the reduction polynomial f(z) = z>+z7+ 1. In this example, we have 
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k = 2 and / =3. Figure 5.10 shows the circuit for digit multiplication excluding the 
interconnect of Figure 5.9 and the interconnect on the c register for the final reduction 
modulo f(z). The final reduction interconnect will require multiplexors. 
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Figure 5.9. Circuit to compute azk? mod f(z), where f(z) = z+z24+1 andk =2. 
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Figure 5.10. A 2-digit multiplier for F5 defined by f(z) = gg 1, 


Squaring 


Squaring can of course be performed using any of the multipliers described above. If 
the reduction polynomial f(z) is fixed and is a trinomial or a pentanomial, then it is 
possible to design a circuit that will perform a squaring operation in a single clock 
cycle (vs. m clock cycles for the bit-serial multipliers). Moreover, the squaring circuit 
will add very little complexity to the multiplication circuit. A squaring circuit that takes 
only one clock cycle is important when inversion is done by multiplication (see below). 
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For example, consider the field F57 with reduction polynomial f(z) =z’ +z+1. If 








a = 62° +a5z° +.a4z4 +4329 +a22z7 +a1z +49, then 
Cc =a 
= aoz!* +.a5z!° 4.4428 +.a32° +.anz4 +4127 +40 
= (ap +.a3)2° +.a62° + (a5 ta) z* +a5z3 + (ag +41) Z7 + a4z+a0 





A squaring circuit is illustrated in Figure 5.11. 
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Figure 5.11. Squaring circuit for F 57 with fixed reduction polynomial f(z) = zg) 2+ 1, 


Inversion 


The most difficult finite field operation to implement in hardware is inversion. There 
are two basic types of inversion algorithms: those based on the extended Euclidean al- 
gorithm and its variants (cf. §2.3.6), and those that use field multiplication. Inversion 
by multiplication does not add significantly to the complexity of a hardware design, but 
can severely impact performance if it is needed frequently. This is the reason why most 
hardware (and for that matter software) designers prefer projective coordinates over 
affine. Additional functionality must be incorporated into the controller but extensive 
modifications to the core circuit are not required. If affine coordinates are preferred, 
then inversion will undoubtedly be the bottleneck in performance thereby necessitating 
an inversion circuit based on the extended Euclidean algorithm. Such a circuit will add 
more complexity to both the core circuit and the controller. It seems that the added com- 
plexity does not justify implementing inversion by the extended Euclidean algorithm, 
and therefore we restrict our attention to inversion methods that use multiplication. 
Let a be a nonzero element in Fy”. Inversion by multiplication uses the fact that 


Par, (5.2) 
Since 2" —2 = 5"! 2!, we have 


m—-1 
agit I]. (5.3) 


i=] 
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Thus, a~! can be computed by m —1 squarings and m — 2 multiplications. We next 
show how the number of multiplications can be reduced. First observe that 


= m _ m—1_ 
a 1_ 2 2 = he 


m—1 2 
2""—| has been evaluated. Now if 


Hence a~! can be computed in one squaring once a 
m is odd then 
A a NP Orr a1), (5.4) 
If we let 
b = gored 


then by (5.4) we have 


m—1_ (m—1)/2 
a —'=b-b* 


m—1 5 ge = . 
Hence a” ~! can be computed with one multiplication and (m — 1)/2 squarings once 


b has been evaluated. Similarly, if m is even then 
on-la=90" = 141 ]20" 7-1" Fs Ss, (5.5) 


If we let 
gim—2)/2_] 
c=a 


then by (5.5) we have 


m—1_ (m—2)/2 2 
a : =a: (c : (oa ) F 


Hence a can be computed with two multiplications and m/2 squarings once c 
has been evaluated. This procedure can be repeated recursively to eventually compute 
a~!. The total number of multiplications in this procedure can be shown to be 


gm-1 —1 





Llogy(m — 1)| + wim—1)—1, (5.6) 


where w(m— 1) denotes the number of Is in the binary representation of m — 1, 
while the total number of squarings is m — 1. This inversion procedure is shown in 
Algorithm 5.6 when m is odd. 


Algorithm 5.6 Inversion in Fy (m odd) 


INPUT: Nonzero element a € Fon. 
OuTPUT: a~!. 
1. Set Aca”, B<1, x <—(m—1)/2. 
2. While x 40 do 
2.1 A<A-A”. 
2.2 If x is even then x << x/2; 
Else B< B- A, A< A’, x —(x —1)/2. 
3. Return(B). 
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Table 5.9 shows the number of squarings and multiplications needed to compute in- 
verses in the NIST binary fields F5163, 5233, F5283,, Fy409 and F’5571 using Algorithm 5.6. 
The last squaring of A in step 2.2 is not required, and therefore is not included in the 
operation counts. 





m Llogs(m—1)| w(m—1) multiplications squarings 


163 7 3 9 162 
233 7 4 10 232 
283 8 4 11 282 
409 8 4 11 408 
571 9 5 13 570 





Table 5.9. Operation counts for inversion in the binary fields F 5163, F233, F283, F409 and F571 
using Algorithm 5.6. 


5.3. Secure implementation 


When assessing the security of a cryptographic protocol, one usually assumes that the 
adversary has a complete description of the protocol, is in possession of all public 
keys, and is only lacking knowledge of the secret keys. In addition, the adversary may 
have intercepted some data exchanged between the legitimate participants, and may 
even have some control over the nature of this data (e.g., by selecting the messages 
in a chosen-message attack on a signature scheme, or by selecting the ciphertext in 
a chosen-ciphertext attack on a public-key encryption scheme). The adversary then 
attempts to compromise the protocol goals by either solving an underlying problem 
assumed to be intractable, or by exploiting some design flaw in the protocol. 

The attacks considered in this traditional security model exploit the mathematical 
specification of the protocol. In recent years, researchers have become increasingly 
aware of the possibility of attacks that exploit specific properties of the implementation 
and operating environment. Such side-channel attacks utilize information leaked dur- 
ing the protocol’s execution and are not considered in traditional security models. For 
example, the adversary may be able to monitor the power consumed or the electromag- 
netic radiation emitted by a smart card while it performs private-key operations such 
as decryption and signature generation. The adversary may also be able to measure 
the time it takes to perform a cryptographic operation, or analyze how a cryptographic 
device behaves when certain errors are encountered. Side-channel information may be 
easy to gather in practice, and therefore it is essential that the threat of side-channel 
attacks be quantified when assessing the overall security of a system. 

It should be emphasized that a particular side-channel attack may not be a realistic 
threat in some environments. For example, attacks that measure power consumption of 
a cryptographic device can be considered very plausible if the device is a smart card 
that draws power from an external, untrusted source. On the other hand, if the device 
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is a workstation located in a secure office, then power consumption attacks are not a 
significant threat. 

The objective of this section is to provide an introduction to side-channel attacks 
and their countermeasures. We consider power analysis attacks, electromagnetic anal- 
ysis attacks, error message analysis, fault analysis, and timing attacks in §5.3.1, §5.3.2, 
§5.3.3, §5.3.4, and §5.3.5, respectively. The countermeasures that have been proposed 
are algorithmic, software-based, hardware-based, or combinations thereof. None of 
these countermeasures are guaranteed to defeat all side-channel attacks. Furthermore, 
they may slow cryptographic computations and have expensive memory or hardware 
requirements. The efficient and secure implementation of cryptographic protocols on 
devices such as smart cards is an ongoing and challenging research problem that 
demands the attention of both cryptographers and engineers. 


5.3.1 Power analysis attacks 


CMOS (Complementary Metal-Oxide Semiconductor) logic is the dominant semicon- 
ductor technology for microprocessors, memories, and application specific integrated 
circuits (ASICs). The basic building unit in CMOS logic is the inverter, or NOT gate, 
depicted in Figure 5.12. It consists of two transistors, one P-type and one N-type, that 
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Figure 5.12. CMOS logic inverter. 


serve as voltage-controlled switches. A high voltage signal is interpreted as a logical 
‘1’, while a low voltage signal is interpreted as a logical ‘0’. If the input voltage Vin 
is low, then the P-type transistor is conducting (i.e., the switch is closed) while the N- 
type transistor is non-conducting; in this case, there is a path from the supply voltage to 
the output and therefore Vout is high. Conversely, if Vin is high, then the P-type tran- 
sistor is non-conducting while the N-type transistor is conducting; in this case, there 
is a path from the output to the ground and therefore Vout is low. When the inverter 
switches state, there is a short period of time during which both transistors conduct 
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current. This causes a short circuit from the power supply to the ground. There is also 
current flow when internal capacitive loads attached to the inverter’s output are charged 
or discharged. 

During a clock cycle, current flows through only a small proportion of the gates in a 
CMOS device—those gates that are active during the execution of a particular instruc- 
tion. Thus, the power consumed by the device can be expected to change continuously 
as the device executes a complicated series of instructions. 

If the power to the device is supplied at a constant voltage, then the power consumed 
by the device is proportional to the flow of current. The current flow, and thus also the 
power consumption, can be measured by placing a resistor in series with the power 
supply and using an oscilloscope to measure the voltage difference across the resistor. 
One can then plot a power trace, which shows the power consumed by the device 
during each clock cycle. 

The hypothesis behind power analysis attacks is that the power traces are correlated 
to the instructions the device is executing as well as the values of the operands it is 
manipulating. Therefore, examination of the power traces can reveal information about 
the instructions being executed and contents of data registers. In the case that the device 
is executing a secret-key cryptographic operation, it may then be possible to deduce the 
secret key. 


Simple power analysis 


In simple power analysis (SPA) attacks, information about secret keying material is 
deduced directly by examining the power trace from a single secret key operation. 
Implementations of elliptic curve point multiplication algorithms are particularly vul- 
nerable because the usual formulas for adding and doubling points are quite different 
and therefore may have power traces which can readily be distinguished. Figure 5.13 
shows the power trace for a sequence of addition (S) and double (D) operations on an 
elliptic curve over a prime field. Points were represented using Jacobian coordinates 
(see §3.2.1) whereby an addition operation takes significantly longer than a double 
operation. 
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Figure 5.13. Power trace for a sequence of addition (S) and double (D) operations on an elliptic 
curve over a prime field. Points were represented using Jacobian coordinates. The traces were 
obtained from an SC140 DSP processor core. 
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Consider, for example, a device that performs a point multiplication kP during 
ECDSA signature generation (Algorithm 4.29). Here, P is a publicly-known elliptic 
curve point and k is a secret integer. Recall that knowledge of a single per-message 
secret k and the corresponding message and signature allows one to easily recover the 
long-term private key (cf. Note 4.34). Suppose first that one of the binary methods for 
point multiplication (Algorithms 3.26 and 3.27) is used. If examination of a power trace 
of a point multiplication reveals the sequence of double and addition operations, then 
one immediately learns the individual bits of k. Suppose now that a more sophisticated 
point multiplication method is employed; for concreteness consider the binary NAF 
method (Algorithm 3.31). If the power trace reveals the sequence of double and addi- 
tion operations, then an adversary learns the digits of NAF(k) that are 0, which yields 
substantial information about k. 


Knowledge of how the algorithm is used and implementated facilitate SPA attacks. 
Any implementation where the execution path is determined by the key bits has a 
potential vulnerability. 


Countermeasures Numerous techniques for resisting SPA attacks have been pro- 
posed. These countermeasures involve modifications to the algorithms, software 
implementations, hardware implementations, or combinations thereof. The effective- 
ness of the countermeasures is heavily dependent on the characteristics of the hardware 
platform, the operating environment, and the capabilities of the adversary, and must be 
evaluated on a case-by-case basis. As an example, Figure 5.14 shows for the power 
trace for a sequence of addition (S) and double (D) operations on an elliptic curve 
over a prime field. Dummy operations were inserted in the algorithms for addition 
and doubling in such a way that the sequence of elementary operations involved in 
a doubling operation is repeated exactly twice in an addition operation. Compared to 
Figure 5.13, it seems impossible to distinguish the addition and double operations by 
casual inspection of the power trace in Figure 5.14. 
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Figure 5.14. Power trace for a sequence of addition (S) and double (D) operations on an elliptic 
curve over a prime field. Points were represented using Jacobian coordinates. SPA resistance was 
achieved by insertion of dummy operations in the addition and double algorithms (compare with 
Figure 5.13). The traces were obtained from an SC140 DSP processor core. 
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None of the countermeasures that have been proposed are guaranteed to provide 
adequate protection. It is also important to note that resistance to SPA attacks does not 
guarantee resistance to other side-channel attacks such as differential power analysis 
and electromagnetic analysis attacks. It is therefore impossible at present to provide 
general recommendations for the best countermeasures to SPA attacks. Instead we just 
give one example and list other methods in the Notes section starting on page 254. 

Algorithm 5.7 is a modification of the left-to-right binary point multiplication 
method to provide enhanced resistance to SPA attacks. Dummy operations are included 
in the main loop so that the same basic elliptic curve operations (one double and one 
addition) are performed in each iteration. Thus the sequence of double and additions 
deduced from the power trace does not reveal any information about the bits of k. As 
with most algorithmic countermeasures, the increased security comes at the expense of 
slower performance. 


Algorithm 5.7 SPA-resistant left-to-right binary point multiplication 
INPUT: k = (ky-1,...,k1,ko)2, P€ E(FQ). 


OUTPUT: KP. 
1. Op<~. 
2. For i from t — | downto 0 do 
21 Qo <2Qo. 
2.22 QO1<Qo+ P. 
2.3 Qo <— Ox;- 


3. Return(Qo). 


Differential power analysis 


Differential power analysis (DPA) attacks exploit variations in power consumption that 
are correlated to the data values being manipulated. These variations are typically much 
smaller than those associated with different instruction sequences, and may be obfus- 
cated by noise and measurement errors. Statistical methods are used on a collection of 
power traces in order reduce the noice and strengthen the differential signals. 

To launch a DPA attack, an adversary first selects an internal variable V that is en- 
countered during the execution of the cryptographic operation and has the property that 
knowledge of the input message m and a portion k’ of the unknown secret key deter- 
mines the value of V. The determining function V = f(k’,m) is called the selection 
function. Let us assume for simplicity that V is a single bit. The adversary collects a 
number of power traces (e.g., a few thousand) from the device that performs the cryp- 
tographic operation. She then makes a guess for k’, and partitions the power traces into 
two groups according to the predicted value of the bit V. The power traces in each 
group are averaged, and the difference of the averages, called the differential trace, is 
plotted. The idea is that the value of V will have some (possibly very small) influence 
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on the power trace. Thus, if the guess for k’ is incorrect, then the partition of power 
traces was essentially done randomly, and so one would expect the differential trace to 
be flat. On the other hand, if the guess for k’ is correct then the two averaged power 
traces will have some noticeable differences; one would expect the plot of the differ- 
ential trace to be flat with spikes in regions influenced by V. This process is repeated 
(using the same collection of power traces) until k’ is determined. 

These ideas are illustrated in the following DPA attack on the SPA-resistant 
point multiplication method of Algorithm 5.7. The attack demonstrates that SPA 
countermeasures do not necessarily resist DPA attacks. 

DPA attacks are generally not applicable to point multiplication in the signature 
generation procedure for elliptic curve signature schemes such as ECDSA (Algo- 
rithm 4.29) since the secret key k is different for each signature while the base point P 
is fixed. However, the attacks can be mounted on point multiplication in elliptic curve 
encryption and key agreement schemes. For example, for the point multiplication in 
the ECIES decryption procedure (Algorithm 4.43), the multiplier is k = hd where d is 
the long-term private key and / is the cofactor, and the base point is P = R where R is 
the point included in the ciphertext. 

Suppose now that an adversary has collected the power traces as a cryptographic 
device computed kP;, kPo,...,kP, using Algorithm 5.7. The adversary knows 
P|, P2,..., P- and wishes to determine k. If Qo = o© then the doubling operation in 
step 2.1 is trivial and therefore can likely be distinguished from a non-trivial doubling 
operation by examination of a single power trace. Thus, the attacker can easily de- 
termine the leftmost bit of k that is 1. Let us suppose that k;_; = 1. The following 
assignments are made in the first iteration of step 2 (with i = t— 1): Q9<-co, Q1 <P, 
Qo < P. In the second iteration of step 2 (with i = t —2) the assignments are Qo <-2P, 
Q, <3P, and either Op <—2P (if k;_-2 = 0) or Op —3P (if ky-2 = 1). It follows that 
the point 4P is computed in a subsequent iteration if and only if k;_2 = 0. A position 
in the binary representation of a point is selected, and the power traces are divided into 
two groups depending on whether the selected bit of 4P; is 0 or 1. In the notation of 
the generic description of DPA attacks, the key portion is k’ = k;_2, m = P,, and the 
selection function f computes the selected bit of 4P;. If the differential trace has some 
noticeable spikes, then the adversary concludes that k;2 = 0; otherwise k;~-2 = 1. Once 
k;—2 has been determined, the adversary can similarly infer k;_3 and so on. 


Countermeasures As is the case with SPA attacks, numerous techniques for resisting 
DPA attacks have been proposed. Again, none of them are guaranteed to be sufficient 
and their effectiveness must be evaluated on a case-by-case basis. These countermea- 
sures are surveyed in the Notes section starting on page 254. Here we only present one 
countermeasure that provides resistance to the particular DPA attack described above 
for point multiplication. 

Suppose that the field F, had characteristic > 3, and suppose that mixed Jacobian- 
affine coordinates (see §3.2.2) are used in Algorithm 5.7. Thus, the point P is stored 
in affine coordinates, while the points Qo and Q, are stored in Jacobian coordinates. 
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The first assignment of Q; is Q; <— P; if P = (x, y) in affine coordinates, then Q; = 
(x : y: 1) in Jacobian coordinates. After this first assignment, the coordinates of Qj 
are randomized to (A2x, 3 y,A), where 4 is a randomly selected nonzero element in 
Ij, and the algorithm proceeds as before. The DPA attack described above is thwarted 
because the adversary is unable to predict any specific bit of 4P; (or other multiples of 
P;) in randomized Jacobian coordinates. 


5.3.2 Electromagnetic analysis attacks 


The flow of current through a CMOS device also induces electromagnetic (EM) emana- 
tions. The EM signals can be collected by placing a sensor close to the device. As with 
power analysis attacks, one can now analyze the EM signals in the hope that they reveal 
information about the instructions being executed and contents of data registers. Simple 
ElectroMagnetic Analysis (SEMA) attacks and Differential ElectroMagnetic Analysis 
(DEMA) attacks, analogues of SPA and DPA attacks, can be launched. As with power 
analysis attacks, these electromagnetic analysis (EMA) attacks are non-intrusive and 
can be performed with relatively inexpensive equipment. 

Since EM emanations may depend on the physical characteristics of the active gates, 
a single EM sensor captures multiple EM signals of different types. These signals can 
be separated and analyzed individually. This is unlike the case of power analysis attacks 
where the power consumption measured is the single aggregation of power consumed 
by all active units. Consequently, EMA attacks can potentially reveal more information 
than power analysis attacks, and therefore constitute a more significant threat. 

The most comprehensive study on EMA attacks was undertaken in 2002 by IBM 
researchers Agrawal, Archambeault, Rao and Rohatgi, who conducted experiments on 
several smart cards and a server containing an SSL accelerator. Their experiments pro- 
vide convincing evidence that the output of a single wideband EM sensor consists of 
multiple EM signals, each of which can encode somewhat different information about 
the device’s state. Moreover, they succeeded in using EMA attacks to compromise the 
security of some commercially available cryptographic devices that had built-in coun- 
termeasures for resisting power analysis attacks, thus demonstrating that EMA attacks 
can indeed be more powerful than power analysis attacks. 

As with power analysis, EMA countermeasures could be hardware based (e.g., metal 
layers to contain the EM emanations or circuit redesign to reduce the EM emanations) 
or software based (e.g., use of randomization). The study of EMA attacks is relatively 
new, and it remains to be seen which countermeasures prove to be the most effective. 


5.3.3 Error message analysis 


Another side channel that may be available to an adversary is the list of error messages 
generated by the victim’s cryptographic device. Consider, for example, the decryption 
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process of a public-key encryption scheme such as ECIES (see §4.5.1). A ciphertext 
might be rejected as invalid because some data item encountered during decryption 
is not of requisite form. In the case of ECIES decryption (Algorithm 4.43), a ci- 
phertext (R,C,t) will be rejected if embedded public key validation of R fails, or 
if Z =hdR =o, or if the authentication tag f is invalid. There are several ways in 
which the adversary may learn the reason for rejection. For example, the error message 
may be released by the protocol that used the encryption scheme, the adversary may 
be able to access the error log file, or the adversary may be able to accurately time 
the decryption process thereby learning the precise point of failure. An adversary who 
learns the reason for rejection may be able to use this information to its advantage. 

To illustrate this kind of side-channel attack, we consider Manger’s attack on the 
RSA-OAEP encryption scheme. Manger’s attack is very effective, despite the fact that 
RSA-OAEP has been proven secure (in the random oracle model). This supports the 
contention that a cryptographic scheme that is secure in a traditional security model is 
not necessarily secure when deployed in a real-world setting. 


RSA-OAEP encryption scheme 


RSA-OAEP is intended for the secure transport of short messages such as symmetric 
session keys. It first formats the plaintext message using Optimal Asymmetric Encryp- 
tion Padding (OAEP), and then encrypts the formatted message using the basic RSA 
function. RSA-OAEP has been proven secure (in the sense of Definition 4.41) under 
the assumption that the problem of finding eth roots modulo n is intractable, and that 
the hash functions employed are random functions. The following notation is used in 
the descriptions of the encryption and decryption procedures. 


1. A’s RSA public key is (n, e), and d is A’s corresponding private key. The integer 
nis k bytes in length. For example, if n is a 1024-bit modulus, then k = 128. 


2. H is a hash function with /-byte outputs. For example, H may be SHA-1! in 
which case / = 20. 


3. P consists of some encoding parameters. 
4. padding consists of a string of 00 bytes (possibly empty) followed by a 01 byte. 


5. Gis a mask generating function. It takes as input a byte string s and an output 
length t, and generates a (pseudorandom) byte string of length ¢ bytes. In prac- 
tice, G(s, t) may be defined by concatenating successive hash values H(s ||7), 
for 0 <i < [t//| —1, and deleting any rightmost bytes if necessary. 


The concatenation m of maskedS and maskedPM is a byte string of length k — 1. 
This ensures that the integer representation m of 77 is less than the modulus n which is 
k bytes in length, and hence m can be recovered from c. 
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Algorithm 5.8 RSA-OAEP encryption 


INPUT: RSA public key (n, e), message M of length at most k — 2 — 21 bytes. 
OUTPUT: Ciphertext c. 


1. Select a random seed S of length / bytes. 


2. Apply the OAEP encoding operation, depicted in Figure 5.15, with inputs S, P 
and M to obtain an integer m: 


2.1 Form the padded message PM of length k —/—1 bytes by concatenating 
H(P), a padding string of the appropriate length, and M. 

2.2 Compute maskedPM = PM ® G(S,k—1—1). 

2.3 Compute maskedS = S @ G(maskedPM, 1). 


2.4 Concatenate the strings maskedS and maskedPM and convert the result ™ 
to an integer m. 


3. Compute c = m° mod n. 
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Figure 5.15. OAEP encoding function. 


Algorithm 5.9 RSA-OAEP decryption 
INPUT: RSA public key (n, e), private key d, ciphertext c. 
OUTPUT: Plaintext M or rejection of the ciphertext. 

1. Check that c € [0, n — 1]; if not then return(“Reject the ciphertext’). 


. Compute m = c4 mod n. 
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2 

3. Convert m to a byte string m of length k. Let X denote the first byte of 77. 

4. If X £00 then return(“Reject the ciphertext’). 

5. Apply the OAEP decoding operation, depicted in Figure 5.16 with inputs P, m: 

5.1 Parse m to obtain X, a byte string maskedS of length /, and a byte string 
maskedPM of length k —/—1. 

5.2 Compute S = maskedS ® G(maskedPM, 1). 

5.3 Compute PM = maskedPM @ G(S,k—1—1). 

5.4 Separate PM into a byte string Q consisting of the first / bytes of PM, a 
(possibly empty) byte string PS consisting of all consecutive zero bytes 
following Q, a byte T, and a byte string M. 

5.5 If T £01 then return(“Reject the ciphertext’). 

5.6 If OQ 4 H(P) then return(“Reject the ciphertext’). 


6. Return(/). 
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Figure 5.16. OAEP decoding function. 


Manger’s attack and countermeasures 


A ciphertext c’ € [0, — 1] may be invalid for several reasons: either X 4 00 in step 4 
of Algorithm 5.9, or T #01 in step 5.5, or Q 4 H(P) in step 5.6. Manger’s attack 
assumes that an adversary is able to ascertain whether X 4 00 in the case that c’ is found 
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to be invalid by the decryptor. The attack does not require the full power of a chosen- 
ciphertext attack—the adversary does not need to learn the plaintexts corresponding to 
ciphertexts of her choosing. 

Suppose now that the adversary wishes to decrypt a target ciphertext c that was 
encrypted using A’s RSA key. Since c is valid, the adversary knows a priori that 
m = c@ mod n lies in the interval J = [0, 284—! — 1]. The adversary selects cipher- 
texts c’ related to c in such a way that knowledge of whether the leftmost byte X’ of 
m’ satisfies X’ # 00 allows her to decrease the length of interval J known to contain m 
by a factor (roughly) of 2. We will not present the technical details of how c’ is cho- 
sen but only mention that this can be done very efficiently. After presenting about 8k 
such ciphertexts c’ to A and learning where the corresponding X’ satisfy X’ 4 00, the 
interval J will have only one integer in it, and adversary will thereby have recovered 
m and can easily compute the plaintext M. If n is a 1024-bit integer, then only about 
1024 interactions are required with the victim and hence the attack should be viewed 
as being quite practical. 

The attack can be prevented by ensuring that the decryption process returns identical 
error messages if any of the three checks fail. Moreover, to prevent the possibility of 
an adversary deducing the point of error by timing the decryption operation, the checks 
in steps 4 and 5.5 of Algorithm 5.9 should be deferred until H(P) has been computed 
and is being compared with Q in step 5.6. 


5.3.4 Fault analysis attacks 


Boneh, DeMillo and Lipton observed that if an error occurs while a cryptographic 
device is performing a private-key operation, then the output of the cryptographic op- 
eration may be incorrect and thereby provide exploitable information to an adversary. 
Such errors may be introduced by non-malicious agents (e.g., hardware failures, soft- 
ware bugs, or external noise) or may be induced by a malicious adversary who has 
physical access to the device. 

Fault analysis attacks generally do not pose a significant threat in practice. However, 
if the environment in which cryptographic operations are being performed is conducive 
to either non-malicious or induced errors, then suitable precautions should be taken. 
These include verifying the result of a computation before exposing it, and using error- 
control techniques to detect or correct data errors in internal memory. 

We illustrate the basic ideas by presenting fault analysis attacks and countermeasures 
on the RSA signature scheme. 


RSA signature generation 


Consider the FDH (Full Domain Hash) variant of the RSA signature scheme with 
public key (7, e) and private key d. The signature of a message M is 


s=m! modn, (5.7) 
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where m = H(M) and H is a hash function whose outputs are integers in the interval 
[0,1 — 1]. The signature s on M is verified by computing m = H(M) and m’ = s° mod 
n, and then checking that m = m’. 

In order to accelerate the signing operation (5.7), the signer computes 


p= m2? mod p and sg= m4 mod qs (5.8) 


where p and q are the prime factors of n, dp =d mod (p—1), and d, =d mod (q — 1). 
Then the signature s can be computed as 


§ = aS, + bsg mod n, 
where a and D are integers satisfying 


_ | Ll (mod p) _ | 0 (mod p) 
a={ (moda bal | (mod q). 


The integers d,, dy, a and b can be precomputed by the signer. This signing procedure 
is faster because the two modular exponentiations in (5.8) have exponents and moduli 
that are half the bitlengths of the exponent and modulus in (5.7). 

Suppose now that an error occurs during the computation of s, and that no errors 
occur during the computation of s,. In particular, suppose that sp») # m¢» (mod p) and 
Sg= ma (mod q). Thus 


dq 


sx mp (mod p) and s=m (mod q) 


whence 


s°#m (mod p) and s°=m (mod q). 


It follows that 
gcd(s° —m,n) = 4, (5.9) 


and so an adversary who obtains the message representative m and the (incorrect) 
signature s can easily factor n and thereafter compute the private key d. 

One method for resisting this particular fault analysis attack on RSA signatures is to 
incorporate some randomness in the formation of the message representative m from 
the message M in such a way that an adversary cannot learn m from an erroneous 
signature (and thus cannot evaluate the gcd in (5.9)). This property holds in the PSS 
(Probabilistic Signature Scheme) variant of the RSA signature scheme. Note, however, 
that there may exist other kinds of fault analysis attacks that are effective on PSS. 

The simplest and most effective countermeasure is to insist that the device verify the 
signature before transmission. 
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5.3.5 Timing attacks 


The premise behind timing attacks is that the amount of time to execute an arithmetic 
operation can vary depending on the value of its operands. An adversary who is capa- 
ble of accurately measuring the time a device takes to execute cryptographic operations 
(e.g., signature generation on a smart card) can analyze the measurements obtained to 
deduce information about the secret key. Timing attacks are generally not as serious 
a threat as power analysis attacks to devices such as smart cards because they typi- 
cally require a very large number of measurements. However, recent work by Boneh 
and Brumley has shown that timing attacks can be a concern even when launched 
against a workstation running a protocol such as SSL with RSA over a local network 
(where power analysis attacks may not be applicable). Thus, it is prudent that security 
engineers consider resistance of their systems to timing attacks. 

While experimental results on timing attacks on RSA and DES implementations 
have been reported in the literature, there have not been any published reports on timing 
attacks on implementations of elliptic curve systems. The attacks are expected to be 
especially difficult to mount on elliptic curve signature schemes such as ECDSA since 
a fresh per-message secret k is chosen each time the signature generation procedure is 
invoked. 


5.4 Notes and further references 


§5.1 

The features of the Intel IA-32 family of processors are described in [210]. References 
for optimization techniques for the Pentium family of processors include the Intel man- 
uals [208, 209] and Gerber [171]. SIMD capabilities of the AMD K6 processor are 
detailed in [4]. Footnote | on instruction latency and throughput is from Intel [209]. 


The SPARC specification is created by the Architecture Committee of SPARC Interna- 
tional (http:/www.sparc.org), and is documented in Weaver and Germond [476]; see 
also Paul [371]. The V9 design was preceded by the Texas Instruments and Sun Super- 
SPARC and the Ross Technology HyperSPARC, both superscalar. Examples 5.1 and 
5.2 are based in part on GNU MP version 4.1.2. 


The fast implementations for finite field and elliptic curve arithmetic in P-224 using 
floating-point operations described in §5.1.2 are due to Bernstein [42, 43]. Historical 
information and references are provided in [42]. Required numerical analyses of the 
proposed methods for P-224 were not complete as of 2002. Bernstein has announced 
that “Fast point multiplication on the NIST P-224 elliptic curve” is expected to be 
included in his forthcoming book on High-speed Cryptography. 


Although SIMD is often associated with image and speech applications, Intel [209] 
also suggests the use of such capabilities in “encryption algorithms.” Aoki and Lipmaa 
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[17] evaluated the effectiveness of MMX-techniques on the AES finalists, noting that 
MMxX was particularly effective for Rijndael; see also Lipmaa’s [298] implementation 
of the IDEA block cipher. In cross-platform code distributed for solving the Certi- 
com ECC2K-108 Challenge [88] (an instance of the elliptic curve discrete logarithm 
problem for a Koblitz curve over a 109-bit binary field), Robert Harley [191] provided 
several versions of field multiplication routines. The MMX version was “about twice as 
fast” as the version using only general-purpose registers. The Karatsuba-style approach 
worked well for the intended target; however, the fastest versions of Algorithm 2.36 
using only general-purpose registers were competitive in our tests. 


Integer multiplication in Example 5.2 uses only scalar operations in the SSE2 in- 
struction set. Moore [332] exploits vector capabilities of the 128-bit SSE2 registers 
to perform two products simultaneously from 32-bit values in each 64-bit half of the 
register. The method is roughly operand scanning, obtaining the matrix (a;b;) of prod- 
ucts of 29-bit values a; and b; in submatrices of size 4 x 4 (corresponding to values in 
a pair of 128-bit registers). A shuffle instruction (pshufd) is used extensively to load a 
register with four 32-bit components selected from a given register. Products are accu- 
mulated, but “carry processing” is handled in a second stage. The supplied code adapts 
easily to inputs of fairly general size; however, for the specific case discussed in Exam- 
ple 5.2, the method was not as fast as a (fixed size) product-scanning approach using 
scalar operations. 


Of recent works that include implementation details and timings on common general- 
purpose processors, the pair of papers by Lim and Hwang [293, 294] are noted for 
the extensive benchmark data (on the Intel Pentium II and DEC Alpha), especially 
for OEFs. Smart [440] compares representative prime, binary, and optimal extension 
fields of approximately the same size, in the context of elliptic curve methods. Tim- 
ings on a Sun UltraSPARC Ili and an Intel Pentium Pro are provided for field and 
elliptic curve operations. Coding is in C++ with limited in-line assembly; a Karatsuba- 
Ofman method with lookup tables for multiplication of polynomials of degree less 
than 8 is used for the binary field. Hankerson, Lopez, and Menezes [189] and Brown, 
Hankerson, Lopez, and Menezes [77] present an extensive study of software imple- 
mentation for the NIST curves, with field and curve timings on an Intel Pentium II. 
De Win, Mister, Preneel, and Wiener [111] compare ECDSA to DSA and RSA sig- 
nature algorithms. Limited assembly on an Intel Pentium Pro was used for the prime 
field; reduction is via Barrett. The binary field arithmetic follows Schroeppel, Orman, 
O’ Malley, and Spatscheck [415]; in particular, the almost inverse algorithm (Algorithm 
2.50) is timed for two reduction trinomials, one of which is favourable to the almost 
inverse method. 


Implementors for constrained devices such as smartcards and handhelds face a dif- 
ferent set of challenges and objectives. An introductory survey of smartcards with 
cryptographic capabilities circa 1995 is given by Naccache and M’Raihi [339]. Du- 
rand [126] compares inversion algorithms for prime characteristic fields, and provides 
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timings for RSA decryption and elliptic curve point multiplication on RISC processors 
from SGS-Thomson. Hasegawa, Nakajima, Matsui [194] implement ECDSA on a 16- 
bit CISC M16C processor from Mitsubishi. Low memory consumption was paramount, 
and elliptic curve point operations are written to use only two temporary variables. 
The ECDSA implementation including SHA-1 required 4000 bytes. A prime of the 
form p = e2“ +1 is proposed for efficiency, where e fits within a word (16 bits in 
this case), and a is a multiple of the word size; in particular, p = 65112- i] 
of 160 bits is used for the implementation. Itoh, Takenaka, Torii, Temma, and Kuri- 
hara [216] implement RSA, DSA, and ECDSA on the Texas Instruments digital signal 
processor TMS320C620. Pipelining improvements are proposed for a Montgomery 
multiplication algorithm discussed in [260]. A consecutive doubling algorithm reduces 
the number of field multiplications (with a method related to the modified Jacobian 
coordinates in Cohen, Miyaji, and Ono [100]); field additions are also reduced under 
the assumption that division by 2 has cost comparable to field addition (see §3.2.2). 
Guajardo, Bliimel, Krieger, and Paar [182] target low-power and low-cost devices 
based on the Texas Instruments MSP430x33x family of 16-bit RISC microcontrollers. 
Implementation is over F, for prime p = 2!28 997 _ 1, suitable for lower-security ap- 
plications. Inversion is based on Fermat’s theorem, and the special form of the modulus 
is used to reduce the amount of precomputation in a k-ary exponentiation method. 


OEFs have been attractive for some constrained devices. Chung, Sim, and Lee [97] 
discuss performance and implementation considerations for a low-power Samsung 
CalmRISC 8-bit processor with a MAC2424 math coprocessor. The coprocessor op- 
erates in 24-bit or 16-bit mode; the 16-bit mode was selected due to performance 
restrictions. Timings are provided for field and curve operations over F,10 with p = 
2!6 _ 165 and reduction polynomial f(z) = g0_2. Woodbury, Bailey, and Paar [486] 
examine point multiplication on very low-cost Intel 8051 family processors. Only 256 
bytes of RAM are available, along with slower external XRAM used for precomputa- 
tion. Implementation is for a curve over the OEF F(28_17)!7 with reduction polynomial 
f(z) =2!7 —2, suitable for lower-security applications. 


Personal Digital Assistants such as the Palm and RIM offerings have substantial mem- 
ory and processing capability compared with the constrained devices noted above, but 
are less powerful than common portable computers and have power and communication 
bandwidth constraints. Weimerskirch, Paar, and Chang Shantz [477] present implemen- 
tation results for the Handspring Visor with 2 MB of memory and a 16 MHz Motorola 
Dragonball running the Palm OS. Timings are provided for the NIST recommended 
random and Koblitz curves over F163. 


§5.2 
The elliptic curve processor architecture depicted in Figure 5.3 is due to Orlando and 
Paar [361]. 


Beth and Gollmann [45] describe several circuits for F2» multipliers including the 
MSB and LSB versions, and ones that use normal and dual basis representations. The 
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digit-serial multiplier (Algorithm 5.5) was proposed by Song and Parhi [449]. Algo- 
rithm 5.6 for inversion in F2” is due to Itoh and Tsujii [217] (see also Agnew, Beth, 
Mullin and Vanstone [5]). The algorithm is presented in the context of a normal basis 
representation for the elements of Fy. Guajardo and Paar [183] adapted the algorithm 
for inversion in general extension fields (including optimal extension fields) that use a 
polynomial basis representation. 


There are many papers that describe hardware implementations of elliptic curve opera- 
tions. The majority of these papers consider elliptic curves over binary fields. Orlando 
and Paar [361] proposed a scalable processor architecture suitable for the FPGA imple- 
mentation of elliptic curve operations over binary fields. Multiplication is performed 
with the digit-serial circuit proposed by Song and Parhi [449]. Timings are provided 
for the field F'5167. Okada, Torii, Itoh and Takenaka [353] describe an FPGA imple- 
mentation for elliptic curves over F163. Bednara et al. [32] (see also Bednara et al. 
[33]) compared their FPGA implementations of elliptic curve operations over the field 
F191 with polynomial and normal basis representations. They concluded that a poly- 
nomial basis multiplier will require fewer logic gates to implement than a normal 
basis multiplier, and that Montgomery’s method (Algorithm 3.40) is preferred for point 
multiplication. 


The hardware design of Ernst, Jung, Madlener, Huss and Bliimel [134] uses the 
Karatsuba-Ofman method for multiplying binary polynomials. Hardware designs 
intended to minimize power consumption were considered by Goodman and Chan- 
drakasan [177], and by Schroeppel, Beaver, Gonzales, Miller and Draelos [414]. Gura 
et al. [186] designed hardware accelerators that permit any elliptic curve over any bi- 
nary field Fz with m < 255. Architectures that exploit subfields of a binary field were 
studied by Paar and Soria-Rodriguez [365]. 


Hardware implementations of binary field arithmetic that use a normal basis repre- 
sentation are described by Agnew, Mullin, Onyszchuk and Vanstone [6] (for the field 
F503), Agnew, Mullin and Vanstone [7] (for the field F515s), Gao, Shrivastava and So- 
belman [162] (for arbitrary binary fields), and Leong and Leung [286] (for the fields 
F5113, F155 and F 5173). The latter two papers include both the finite field operations and 
the elliptic curve operations. 





Koren’s book [266] is an excellent introduction to hardware architectures for perform- 
ing the basic integer operations of addition, subtraction and multiplication. Orlando 
and Paar [362] detail a scalable hardware architecture for performing elliptic curve 
arithmetic over prime fields. 


Savas, Tenca and Kog [404] and Grofischadl [181] introduced scalable multipliers for 
performing multiplication in both prime fields and binary fields. For both designs, the 
unified multipliers require only slightly more area than for a multiplier solely for prime 
fields. Multiplication in the Savas, Tenca and Kog design is performed using Monto- 
gomery’s technique (cf. §2.2.4), while Grofschiédl’s design uses the more conventional 
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approach of accumulating partial products. Unified designs for Montgomery inversion 
in both prime fields and binary fields were studied by Gutub, Tenca, Savas and Kog 
[187]. An architecture with low power consumption for performing all operations in 
both binary fields and prime fields was presented by Wolkerstorfer [485]. 


Bertoni et al. [44] present hardware architectures for performing multiplication in F ym 
where p is odd, with an emphasis on the case p = 3; see also Page and Smart [366]. 


§5.3 

Much of the research being conducted on side-channel attacks and their counter- 
measures is presented at the conference on “Cryptographic Hardware and Embedded 
Systems” that have been held annually since 1999. The proceedings of these confer- 
ences are published by Springer-Verlag [262, 263, 261, 238]. Side-channel attacks do 
not include exploitation of common programming and operational errors such as buffer 
overflows, predictable random number generators, race conditions, and poor password 
selection. For a discussion of the security implications of such errors, see the books by 
Anderson [11] and Viega and McGraw [473]. 


SPA and DPA attacks were introduced in 1998 by Kocher, Jaffe and Jun [265]. Coron 
[104] was the first to apply these attacks to elliptic curve cryptographic schemes, and 
proposed the SPA-resistant method for point multiplication (Algorithm 5.7), and the 
DPA-resistant method of randomizing projective coordinates. Oswald [364] showed 
how a multiplier k can be determined using the partial information gained about 
NAF(k) from a power trace of an execution of the binary NAF point multiplication 
method (Algorithm 3.31). Experimental results with power analysis attacks on smart 
cards were reported by Akkar, Bevan, Dischamp and Moyart [9] and Messerges, Dab- 
bish and Sloan [323], while those on a DSP processor core are reported by Gebotys 
and Gebotys [168]. Figures 5.13 and 5.14 are taken from Gebotys and Gebotys [168]. 


Chari, Jutla, Rao and Rohatgi [91] presented some general SPA and DPA coun- 
termeasures, and a formal methodology for evaluating their effectiveness. Proposals 
for hardware-based defenses against power analysis attacks include using an internal 
power source, randomizing the order in which instructions are executed (May, Muller 
and Smart [308]), randomized register renaming (May, Muller and Smart [309]), and 
using two capacitors, one of which is charged by an external power supply and the 
other supplies power to the device (Shamir [422]). 


One effective method for guarding against SPA attacks on point multiplication is to 
employ elliptic curve addition formulas that can also be used for doubling. This ap- 
proach was studied by Liardet and Smart [291] for curves in Jacobi form, by Joye and 
Quisquater [231] for curves in Hessian form, and by Brier and Joye [74] for curves 
in general Weierstrass form. Izu and Takagi [221] devised an active attack (not using 
power analysis) on the Brier-Joye formula that can reveal a few bits of the private key 
in elliptic curve schemes that use point multiplication with a fixed multiplier. Another 
strategy for SPA resistance is to use point multiplication algorithms such as Coron’s 
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(Algorithm 5.7) where the pattern of addition and double operations is independent 
of the multiplier. Other examples are Montgomery point multiplication (see page 102 
and also Okeya and Sakurai [358]), and the methods presented by Moller [327, 328], 
Hitchcock and Montague [198], and Izu and Takagi [220]. The security and efficiency 
of (improved versions) of the Méller [327] and Izu-Takagi [220] methods were care- 
fully analyzed by Izu, Moller and Takagi [219]. Another approach taken by Trichina 
and Bellezza [461] and Gebotys and Gebotys [168] is to devise formulas for the ad- 
dition and double operations that have the same pattern of field operations (addition, 
subtraction, multiplication and squaring). 


Hasan [193] studied power analysis attacks on point multiplication for Koblitz curves 
(see §3.4) and proposed some countermeasures which do not significantly degrade 
performance. 


Joye and Tymen [232] proposed using a randomly chosen elliptic curve isomorphic to 
the given one, and a randomly chosen representation for the underlying fields, as coun- 
termeasures to DPA attacks. Goubin [180] showed that even if point multiplication is 
protected with an SPA-resistant method such as Algorithm 5.7 and a DPA-resistant 
method such as randomized projective coordinates, randomized elliptic curve, or ran- 
domized field representation, the point multiplication may still be vulnerable to a DPA 
attack in situations where an attacker can select the base point (as is the case, for ex- 
ample, with ECIES). Goubin’s observations highlight the difficulty in securing point 
multiplication against power analysis attacks. 


The potential of exploiting electromagnetic emanations has been known in military cir- 
cles for a long time. For example, see the recently declassified TEMPEST document 
written by the National Security Agency [343] that investigates different compromising 
emanations including electromagnetic radiation, line conduction, and acoustic emis- 
sions. The unclassified literature on attack techniques and countermeasures is also 
extensive. For example, Kuhn and Anderson [272] discuss software-based techniques 
for launching and preventing attacks based on deducing the information on video 
screens from the electromagnetic radiations emitted. Loughry and Umphress [302] de- 
scribe how optical radiation emitted from computer LED (light-emitting diodes) status 
indicators can be analyzed to infer the data being processed by a device. Chapter 15 of 
Anderson’s book [11] provides an excellent introduction to emission security. Exper- 
imental results on electromagnetic analysis (EMA) attacks on cryptographic devices 
such as smart cards and comparisons to power analysis attacks were first presented 
by Quisquater and Samyde [386] and Gandolfi, Mourtel and Olivier [161]. The most 
comprehensive unclassified study on EMA attacks to date is the work of Agrawal, 
Archambeault, Rao and Rohatgi [8]. 


The first prominent example of side-channel attacks exploiting error messages was 
Bleichenbacher’s 1998 attack [53] on the RSA encryption scheme as specified in 
the PKCS#1 v1.5 standard [394]. This version of RSA encryption, which specifies a 
method for formatting the plaintext message prior to application of the RSA function, 
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is widely deployed in practice including in the SSL protocol for secure web commu- 
nications. For 1024-bit RSA moduli, Bleichenbacher’s attack enables an adversary to 
obtain the decryption of a target ciphertext c by submitting about one million carefully- 
chosen ciphertexts related to c to the victim and learning whether the ciphertexts were 
rejected or not. The attack necessitated a patch to numerous SSL implementations. The 
RSA-OAEP encryption scheme was proposed by Bellare and Rogaway [38] and proved 
secure in the random oracle model by Shoup [427] and Fujisaki, Okamoto, Pointcheval 
and Stern [153]. It has been included in many standards including the v2.2 update of 
PKCS#1 [395]. Manger [303] presented his attack on RSA-OAEP in 2001. Vaude- 
nay [466] described error message analysis attacks on symmetric-key encryption when 
messages are first formatted by padding and then encrypted with a block cipher in CBC 
mode. 


Fault analysis attacks were first considered in 1997 by Boneh, DeMillo and Lipton 
[56, 57], who described such attacks on the RSA signature scheme and the Fiat-Shamir 
and Schnorr identification protocols. Bao et al. [28] presented fault analysis attacks on 
the ElGamal, Schnorr and DSA signature schemes. The FDH and PSS variants of the 
RSA signature scheme are due to Bellare and Rogaway [39], who proved their security 
(in the sense of Definition 4.28) under the assumptions that finding eth roots modulo n 
is intractable and that the hash functions employed are random functions. Fault anal- 
ysis attacks on elliptic curve public-key encryption schemes were presented by Biehl, 
Meyer and Miiller [46]. Their attacks succeed if an error during the decryption process 
produces a point that is not on the valid elliptic curve. The attacks can be prevented 
by ensuring that points that are the result of a cryptographic calculation indeed lie on 
the correct elliptic curve. Biham and Shamir [48] presented fault analysis attacks on 
the DES symmetric-key encryption scheme. Anderson and Kuhn [12] discuss some 
realistic ways of inducing transient faults, which they call glitches. More recently, Sko- 
robogatov and Anderson [437] demonstrated that inexpensive equipment can be used 
to induce faults in a smart card by illuminating specific transistors; they also propose 
countermeasures to these optical fault induction attacks. 


Timing attacks were introduced in 1996 by Kocher [264], who described attacks on 
RSA modular exponentiation. Schindler [407] presented timing attacks on implementa- 
tion of RSA exponentation that employ the Chinese Remainder Theorem. Experimental 
results for an RSA implementation on a smart card were reported by Dhem et al. [117]. 
Timing attacks on DES that recover the Hamming weight of the secret key were de- 
scribed by Hevia and Kiwi [197]. Brumley and Boneh [78] demonstrated that timing 
attacks can reveal RSA private keys from an OpenSSL-based web server over a lo- 
cal network. Canvel, Hiltgen, Vaudenay and Vuagnoux [86] devised timing attacks on 
the CBC-mode encryption schemes used in SSL and TLS; their attacks can decrypt 
commonly used ciphertext such as the encryption of a password. 
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Sample Parameters 


This appendix presents elliptic curve domain parameters D = (q, FR, S,a,b, P,n,h) 
that are suitable for cryptographic use; see §4.2 for a review of the notation. In §A.1, 
an algorithm for testing irreducibility of a polynomial is presented. This algorithm can 
be used to generate a reduction polynomial for representing elements of the finite field 
Fp». Also included in §A.1 are tables of irreducible binary polynomials that are rec- 
ommended by several standards including ANSI X9.62 and ANSI X9.63 as reduction 
polynomials for representing the elements of binary fields F2”. The 15 elliptic curves 
recommended by NIST in the FIPS 186-2 standard for U.S. federal government use are 
listed in §A.2. 


A.1_ Irreducible polynomials 


A polynomial f(z) = dmz' +---+a,z+ao € F,p[z] of degree m > 1 is irreducible 
over F’, if f(z) cannot be factored as a product of polynomials in F ,[z] each of degree 
less than m. Since f(z) is irreducible if and only if a, ! f (z) is irreducible, it suffices to 
only consider monic polynomials (i.e., polynomials with leading coefficient a,, = 1). 
For any prime p and integer m > 1, there exists at least one monic irreducible 
polynomial of degree m in F',[z]. In fact, the exact number of such polynomials is 


1 
Np(m) = — Yi aap“, 
d|m 
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where the summation index d ranges over all positive divisors of m, and the Mébius 
function jw is defined as follows: 


1, ifd=1, 
L(d)=4 0, if d is divisible by the square of a prime, 
(—1)', if d is the product of / distinct primes. 


It has been shown that 
1 Np(m) 


nD 
— wm 


1 
2m ~~ p™ m- 





Thus, if polynomials in F',[z] can be efficiently tested for irreducibility, then irreducible 
polynomials of degree m can be efficiently found by selecting random monic polyno- 
mials of degree m in F,,[z] until an irreducible one is found—the expected number of 
trials is approximately m. 

Algorithm A.1 is an efficient test for deciding irreducibility. It is based on the fact 
that a polynomial f(z) of degree m is irreducible over F’, if and only if gcd( f(z), zP — 
z) =1 foreachi, 1 <i< La. 


Algorithm A.1 Testing a polynomial for irreducibility 


INPUT: A prime p and a polynomial f(z) € F,[z] of degree m > 1. 
OUTPUT: Irreducibility of f(z). 
1. u(z)<z. 
2. For i from 1 to |] do: 
2.1 u(z)<u(z)? mod f(z). 
2.2 d(z)< gced(f (z), u(z) — z). 
2.3 If d(z) #1 then return(‘“reducible”’). 
3. Return(“irreducible’’). 


For each m, 2 < m < 600, Tables A.1 and A.2 list an irreducible trinomial or pen- 
tanomial f(z) of degree m over F2. The entries in the column labeled “T” are the 
degrees of the nonzero terms of the polynomial excluding the leading term z” and 
the constant term 1. For example, T = k represents the trinomial z” + z* + 1, and 
T = (k3, ky, ky) represents the pentanomial z” + 2‘ + 22 + 241 + 1. The following cri- 
teria from the ANSI X9.62 and ANSI X9.63 standards were used to select the reduction 
polynomials: 





(i) If there exists an irreducible trinomial of degree m over F2, then f(z) is the 
irreducible trinomial z’” + z* +1 for the smallest possible k. 

(11) If there does not exist an irreducible trinomial of degree m over F2, then f(z) 
is the irreducible pentanomial z” 4 zk3 4 ko 4 741 4.1 for which (a) k3 is the 
smallest possible; (b) for this particular value of k3, kz is the smallest possible; 
and (c) for these particular values of k3 and k2, ky; is the smallest possible. 
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Table A.1. Irreducible binary polynomials of degree m, 2 < m < 300. 
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Table A.2. Irreducible binary polynomials of degree m, 301 < m < 600. 


16, 15,7 
121 
104 

15;,9;'6 
138 
96,5 
9,6,4 
105 

17, 16,6 
81 
94 

4,3,1 


19, 16,9 
39 

10, 8,7 
10,9, 4 
153 
76,5 
73 

34 
11,9,6 
71 
11,4,2 
14,7,3 
163 
11,6, 1 
153 
28 
15,7,6 
77 

67 
10,5,2 
12,8, 1 
10, 6,4 
13 

146 
13,4,3 
25 
23,22, 16 
12,9,7 
237 
13,7,6 
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A.2 Elliptic curves 


In the FIPS 186-2 standard, NIST recommended 15 elliptic curves of varying security 
levels for U.S. federal government use. The curves are of three types: 


(i) random elliptic curves over a prime field F ,; 
(ii) random elliptic curves over a binary field Fo; and 
(iii) Koblitz elliptic curves over a binary field Fz”. 


Their parameters are listed in §A.2.1, §A.2.2 and §A.2.3, respectively. 

In the tables that follow, integers and polynomials are sometimes represented as 
hexadecimal strings. For example, “Ox1BB5” is the hexadecimal representation of the 
integer 7093. The coefficients of the binary polynomial z!° + z!! 4+. 2°+27+2+1 form 
a binary string “10100000100111” which has hexadecimal representation “0x2827”. 





A.2.1 Random elliptic curves over F p 


Table A.3 lists domain parameters for the five NIST-recommended randomly chosen 
elliptic curves over prime fields F,. The primes p were specially chosen to allow for 
very fast reduction of integers modulo p (see §2.2.6). The selection a = —3 for the co- 
efficient in the elliptic curve equation was made so that elliptic curve points represented 
in Jacobian projective coordinates could be added using one fewer field multiplication 
(see §3.2.2). The following parameters are given for each curve: 


Dp The order of the prime field Fp. 

S The seed selected to randomly generate the coefficients of the elliptic 
curve using Algorithm 4.17. 

r The output of SHA-1 in Algorithm 4.17. 

a,b The coefficients of the elliptic curve y? = x3 +ax +b satisfying rb* =a 
(mod p). 

n The (prime) order of the base point P. 

h The cofactor. 

x,y The x and y coordinates of P. 
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192: p= 2192 

Ox 3045AE6F 
0x 3099D2BB 
Ox 64210519 
Ox FFFFFFFF 
Ox 188DA80E 
y= 0x 07192B95 


p- 
S 
r= 
b 
n 
x 


P-224: p = 2724 2964.1, g=-3, h=1 


S= 0x BD713447 
Ox 5B056C7E 
0x B4050A85 
0x FFFFFFFF 
0x B70EOCBD 
y = 0x BD376388 


~~ 


P-256: p = 2756 — 2224 4.9192 4.996 J, 


S= 0x C49D3608 
r = 0x 7EFBA166 
b= 0x 5AC635D8 
n Ox FFFFFFFF 


x= 0x 6B17D1F2 
y = 0x 4FE342E2 
P-384: p = 2384 — 


S= 0x A335926A 
r= 0x 79D1E655 
495E8042 
b= 0x B3312FA7 
C656398D 
n= 0x FFFFFFFF 
581A0DB2 
x = 0x AA87CA22 
5502F25D 
y= 0x 3617DE4A 
O0A60B1CE 


264 lL a= 
C8422F64 ED579528 
BFCB2538 542DCD5F 
E59C80E7 OFA7E9AB 
FFFFFFFF FFFFFFFF 
BO03090F6 7CBF20EB 
FFC8DA78 631011ED 


99D5C7FC DC45B59F 
11DD68F4 0469EE7F 
OCO4B3AB F5413256 
FFFFFFFF FFFFFFFF 
6BB4BF7F 321390B9 
B5F723FB 4C22DFE6 


86E70493 6A6678EL 
2985BE94 03CB055C 
AA3A93E7 B3EBBD55 
00000000 FFFFFFFF 
E12C4247 F8BCE6E5 
FELA7F9B 8EE7EB4A 
7128 996 4932], 
A319A27A 1D00896A 
F868F02F FF48DCDE 
EA5F744F 62184667 
E23EE7E4 988E056B 
8A2ED19D 2A85C8ED 
FFFFFFFF FFFFFFFF 
48BOA77A ECECL96A 
BE8B0537 8EBIC71E 
BF55296C 3A545E38 
96262C6F 5D9IE9SBF 
1D7E819D 7A431D7C 





P-521: p= 2°2! 
S= 0x D09E8800 
r= 0x 000000B4 
TBD6B533 
55BAD637 
b = 0x 00000051 
8EF109E1 
6B503F00 
n= 0x 000001FF 
FFFFFFFA 
91386409 
x = 0x 000000C6 
6B4D3DBA 
C2E5BD66 
y = 0x 00000118 
273E662C 
9FD16650 








3, h=1 
96CC6717 
0A349495 
8E19F1B9 


l, a= 

291CB853 
8BFA5F42 
28100051 


953EB961 
56193951 


8E1C9AIF 
EC7E937B 


FFFFFFFF 
51868783 


FFFFFFFF 
BF2F966B 


858E06B7 
A14B5E77 


0404E9CD 
EFE75928 


39296A78 
97EE7299 


9A3BC004 
5EF42640 


3, h=1 


D38120EA 
B078B6EF 
72243049 
99DEF 836 
43A18800 
6B24CDD5 


A3B9AB8F 
3C7A7D74 
5044B0B7 
FFFF16A2 
4A03C1D3 
CD4375A0 





E12 


196D5 


5F3D6FE2 


FEB 


8DEEC 


146BC9B1 
F4FFOAFD 


73F 


977A1 


6A948BC5 
F7D12111 
D7BFD8BA 


EOB 


8F03E 


56C21122 


5A0 


74764 


a=-3, h=1 

819F7E90 
CE8D84A9 
651D06B0 
BCE6FAAD 
77037D81 
2BCE3357 


139D26B7 
75D4F7E0 
769886BC 
FFFFFFFF 
63A440F2 
7COF9E16 


a=-3, h=1 


6773A482 
E£14151DD 
CC722483 
E3F82D19 
D3 EC2AEF 
FFFFFFFF 
CCC52973 
F320AD74 
72760AB7 
9292DC29 
90EAQE5F 


393284AA 
3 9D2BDFC 
FFBEOFE9 


929A21A0 
1652COBD 


FFFFFFFF 
7FCC0148 


9E3ECB66 
FE1DC127 


5C8A5FB4 
C550B901 


TAC. 


DAC73 


B80643C1 


181 


FFF, 


6E1 


DIC6E 


FFFFF 





D3B62 


F8F41DBD 


AODA64BA 
264EEEEB 
ED8A3C22 


B68540EE 
3BB1BF07 


FFFFFFFF 
F709A5D0 


2395B442 
A2FFA8DE 


2C7D1BD9 
3FAD0761 


C745DE65 
C146B9B1 
B4D22831 
82FF1012 
1E794811 


6506D031 
270B3943 
13DD2945 
343280D6 
44D58199 


C5114ABC 
CC53BO0F6 
A7179E84 
2DEB33A0 
6B315ECE 


406D0CA1 


FE814112 


FFFFFFFF 


8BA79B98 


289A147C 


077688E4 
OOB8F875 


A2DA725B 
3573DF88 


FFFFFFFF 
3BB5C9B8 


9C648139 
3348B3C1 


98F54449 
353C7086 


218291FB 
2355FFB4 
5C5C2A3D 
115C1D21 
85007E34 


AF317768 
3BCE3C3E 
F3B9CAC2 
F4A13945 
CBB64068 


ODFE6FC5 


0314088F 


C7634D81 


59F741E0 


E9DA3113 


4FBFOAD8 
E523868C 


99B315F3 
3D2C34F1 


FFFFFFFF 
899C47AE 


053FB521 
856A429B 


579B4468 
A272C240 


0104FA0D 
27D2604B 
FC632551 
D898C296 
37BF51F5 


2009540A 


5013875A 


F4372DDF 


82542238 


B5FOB8C0 


F6DOEDB3 
70C1ESBF 


B8B48991 
EF451FD4 


FFFFFFFF 
BB6FB71E 


F828AF60 
F97E7E31 


17AFBD17 
88BE9476 


Table A.3. NIST-recommended random elliptic curves over prime fields. 
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A.2.2. Random elliptic curves over Fy 


Table A.4 lists domain parameters for the five NIST-recommended randomly chosen 
elliptic curves over binary fields F2”. The extension degrees m are prime and were 
selected so that there exists a Koblitz curve over F2~ having almost-prime group or- 
der (see §A.2.3). Algorithm 4.19 was used to generate the coefficient b of an elliptic 
curve over F'9” from the seed S. The output b of the algorithm was interpreted as 
an element of F2” represented with respect to the Gaussian normal basis specified in 
FIPS 186-2. A change-of-basis matrix was then used to transform b to a polynomial 
basis representation—see FIPS 186-2 for more details. The following parameters are 
given for each curve: 


m The extension degree of the binary field Fz. 
f(z) The reduction polynomial of degree m. 


S The seed selected to randomly generate the coefficients of the elliptic 
curve. 

a,b The coefficients of the elliptic curve y* + xy =x? +ax7+b. 

n The (prime) order of the base point P. 

h The cofactor. 


x,y  Thex and y coordinates of P. 


A.2.3 Koblitz elliptic curves over Fy 


Table A.5 lists domain parameters for the five NIST-recommended Koblitz curves over 
binary fields. The binary fields F2” are the same as for the random curves in §A.2.2. 
Koblitz curves were selected because point multiplication can be performed faster than 
for the random curves (see §3.4). The following parameters are given for each curve: 


m The extension degree of the binary field F2. 

f(z) The reduction polynomial of degree m. 

a,b The coefficients of the elliptic curve y* +xy =x? +ax7?+b. 
n The (prime) order of the base point P. 

h The cofactor. 

x,y  Thex and y coordinates of P. 
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A. Sample Parameters 


B-163: m= 163, f(z) =z! 42747642341, a=1, h=2 
S= 0x 85E25BFE 5C86226C DB12016F 7553F9D0 


b= 0x 00000 
n= 0x 000 
x= 0x 00000 
y = 0x 00000 
B-233: m= 


002 0A601907 B8C953CA 1481EB10 
00004 00000000 00000000 000292FE 
003 FOEBA162 86A2D57E A0991168 
000 D51FBC6C 71A0094F A2CDD545 


233, fae 34441, a=l,h 


S= 0x 74D59FFO 7F6B413D OEA14B34 4B20A2DB 











b= 0x 00000 
n= 0x 000 
x = 0x 00000 
y= 0x 000 


066 647EDE6C 332C7F8C 0923BB58 
00100 00000000 00000000 00000000 
OFA C9DFCBAC 8313BB21 39F1BB75 
00100 6A08A419 03350678 E58528BE 


B-283: m = 283, f(z) = 228342242742 41, 
S = 0x 77E2B073 70EBOF83 2A6DD5B6 2DFC88CD 


b= 

3B79A2F5 
n= 

EFADB307 
x= 

86B12053 
y= 


BE8112F4 


0x 027B680A C8B8596D A5A4AF8A 19A0303F 


= 0X O3FFFFFF FFFFFFFF FFFFFFFF FFFFFFFF 


0x 05F93925 8DB7DD90 E1934F8C 70BODFEC 


0x 03676854 FE24141C B98FE6D4 B20D02B4 


B-409: m = 409, f(z) = 7409 4 87 4 l,a=l1,h 


S= 0x 4099B5A4 
b= 0x 0021A5C2 
A9A197B2 
0x 01000000 
F33307BE 
0x 015D4860 
DC255A86 
0x 0061B1CF 
38514F1F 


Se 
Il 


B-57l:m=571, f= b+z424 


S= 0x 2aa058£7 
b= 0x 02F40E7E 
84FFABBD 
7FFEFF7F 
n= 0x 03FFFFFF 
FFFFFFFF 
8382E9BB 
x = 0x 0303001D 
BDE53950 
E1E7769C 
y = 0x 037BF273 
84423843 
1A4827AF 


57F9D69F 
C8EE9FEB 
72822F6C 
00000000 
5FA47C3C 
DO88DDB3 
8A118051 
AB6BE5F3 
DF4B4F40 





79213D09 4C4BCD4D 
5C4B9A75 3B7B476B 
D57A55AA 4F50AE31 
00000000 00000000 
9E052F83 8164CD37 
496B0C60 64756260 
5603AEAB 60794E54 
2BBFA783 24ED106A 
D2181B36 81C364BA 


24+, 





E693A268 

512F7874 4A3205FD 
T7E70C12 A4234C33 
D4994637 E8343E36 
B11Cc5COC 797324F1 
=2 

049B50C3 

213B333B 20E9CE42 
0013H974 E72F8A69 
5FEF65BC 391F8B36 
BF8A0BEF F867A7CA 
acl, k=2 
O6BB84BE 

CA97TFD76 45309FA2 


FFFFEF90 399660FC 
2EED25B8 557EAC9C 


516FF702 350EDDB0 


= 2 

4262210B 

7FD6422E F1F3DD67 
7B13545F 

00000000 00000000 
D9A21173 

441CDE4A F1771D4D 
BB7996A7 

7636B9C5 A7BD198D 
0273C706 





3a0e33ab 
2221F295 
8EFA5933 
2955727A 
FFFFFFFF 
E661CE18 
2FE84E47 
34B85629 
F4C0D293 
8EEC2D19 
42DA639B 
BABO8A57 
1B8AC15B 


486b0f61 0410c53a 
DE297117 B7F3D62F 
2BE7AD67 56A66E29 


FFFFFFFF FFFFFFFF 
FF559873 08059B18 


6C16COD4 OD3CD775 
CDD711A3 5B67FB14 





6DCCFFFE B73D69D7 
6291AF8F 461BB2A8 


a=1, h=2 
7£132310 

5C6A97FF CB8CEFF1 
4AFD185A 78FF12AA 


FFFFFFFF FFFFFFFF 
6823851E C7DD9ICA1 


0A93D1D2 955FA80A 
99AE6003 8614F139 


8C6C27A6 O09CBBCA 
B3531D2F 0485C19B 


81FE115F 
22031D26 
F8F8EB73 
36716F7E 


A581485A 


938A9016 


80E2E198 


826779C8 


4761FA99 


000001E2 


BO1FFE5B 


0158AA4F 


CD6BA8CE 
520E4DE7 


FFFFFFFF 
161DE93D 


A5F40FC8 
4ABFA3B4 


1980F853 
16E2F151 


7D8F90AD 
O03CFEOD7 
71FD558B 
01F81052 


F6263E31 


5BO42A7C 


F8CDBECD 


13FODF45 


D6AC27C8 


AAD6A612 


34E59703 


5488D08F 


4A9A18AD 
3 9BACA0C 


FFFFFFFF 
5174D66E 


DB7B2ABD 
C850D927 


3921E8A6 
6E23DD3C 


Table A.4. NIST-recommended random elliptic curves over binary fields. 


K-163: m = 163, f(z) =z!3 427476423 
0x 00000004 00000000 00000000 00020108 A2EOCCOD 99F8A5EF 
0x 00000002 FE13C053 7BBC11AC AA07D793 DE4E6D5E 5C94EEE8 
0x 00000002 89070FBO 5D38FF58 321F2E80 0536D538 CCDAA3D9 


n= 
x= 
y= 
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1, a=1, b=1, h=2 


K-233: m = 233, f()=23 +2441, a=0, b=1, h=4 
n= 0x 00000080 00000000 00000000 00000000 00069D5B B915BCD4 


x 
y 


0x 00000172 32BA853A 7E731AF1 29F22FF4 149563A4 19C26BF5 
0x 000001DB 537DECE8 19B7F70F 555A67C4 27A8CD9B F18AEB9B 


K-283: m = 283, f(z) = 728342! 4 274-2541, a=0, b=1, h=4 


n= 


1E163C61 
x= 

58492836 
y = 0x 01CCDA38 

77DD2259 


0x OLFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFEQ9AE 2ED07577 


Ox 0503213F 78CA4488 3F1A3B81 62F188E5 53CD265F 23C1567A 


OF1C9E31 8D90F95D 07E5426F E87E45CO E8184698 





K-409: m = 409, f(z) = 4994-28741, a=0, b=1, h=4 


n= 0x 007FFFFF 


20400EC4 
x= 0x 0060F05F 
C460189E 
y= 0x 01E36905 
DA5F6C42 


K-571:m=571, f(2=o b+ 704254 22 
00000000 00000000 00000000 
131850E1 F19A63E4 B391A8DB 


n= 0x 02000000 
00000000 
5CFE778F 

x= 0x 026EB7A8 
43709584 
E2945283 

y= 0x 0349DC80 
9D4979C0 
01cD4c14 


FFFFFFFF 
557D5ED3 
658F49C1 
B5AAAA62 
OB7C4E42 
E9C55215 


637C1001 


59923FBC 82189631 F8103FE4 
93B205E6 47DA304D B4CEBO8C 


A01C8972 


7F4FBF37 4F4AEADE 3BCA9531 
AC44AEA7 4FBEBBB9 F772AEDC 


3EF1C7A3 


FFFFFFFF FFFFFFFF 
E3E7CA5B 4B5C83B8 
AD3AB189 0F718421 
EE222EB1 B35540CF 
ACBAIDAC BF04299C 
AAICA27TA 5863EC48 


FFFFFFFF FFFFFFFF 
EQ1ESFCF 
OEFD0987 E307C84C 
E9023746 
3460782F 918EA427 
D8E0286B 





1, 


a=0, b=1, h=4 
00000000 00000000 
917F4138 B630D84B 


AC9CA297 0012D5D4 
BBD1BA39 494776FB 


ADD58CEC 9F307A54 
B620B01A 7BA7AF1B 


6EFB1AD5 
QA4C9D6E 
56E0C110 


265DFF7F 


16876913 


E4596236 


FFFFFESF 


27ACCFB8 


E6325165 


00000000 
E5D63938 


60248048 
988B4717 


FFC61EFC 
320430C8 


Table A.5. NIST-recommended Koblitz curves over binary fields. 


F173ABDF 
EFAD6126 
5 6FAE6A3 


9445106 


BOC2AC24 


48341161 


83B2D4EA 


F9F67CC2 


E9EA10E3 


00000000 
1E91DEB4 


01841CA4 
ADCA88C7 


006D8A2C 
591984F6 


265 


This page intentionally left blank 





APPENDIX B 








ECC Standards 


Cryptographic standards are important for two reasons: (1) to facilitate the widespread 
use of cryptographically sound and well-specified techniques; and (ii) to promote in- 
teroperability between different implementations. Interoperability is encouraged by 
completely specifying the steps of the cryptographic schemes and the formats for 
shared data such as domain parameters, keys and exchanged messages, and by limiting 
the number of options available to the implementor. 

This section describes the salient features of selected standards and draft standards 
that describe elliptic curve mechanisms for signatures, encryption, and key establish- 
ment. A summary is provided in Table B.1. Electronic copies of the standards can be 
obtained online from the web sites listed in Table B.2. It should be noted that many of 
these standards are updated periodically. Readers should consult the web sites for the 
latest drafts. 


American National Standards Institute (ANSI) The ANSI X9F subcommittee of 
the ANSI X9 committee develops information security standards for the financial ser- 
vices industry. Two elliptic curve standards have been completed: ANSI X9.62 which 
specifies the ECDSA (§4.4.1), and ANSI X9.63 which specifies numerous elliptic 
curve key agreement and key transport protocols including STS (§4.6.1), ECMQV 
(§4.6.2), and ECIES (§4.5.1). The objective of these standards is to achieve a high 
degree of security and interoperability. The underlying finite field is restricted to being 
a prime field F, or a binary field Fy”. The elements of Fy» may be represented using 
a polynomial basis or a normal basis over F. If a polynomial basis is desired, then the 
reduction polynomial must be an irreducible trinomial, if one exists, and an irreducible 
pentanomial otherwise. To facilitate interoperability, a specific reduction polynomial is 
recommended for each field Fz»; these polynomials of degree m, where 2 < m < 600, 
are listed in Tables A.1 and A.2. If a normal basis is desired, a specific Gaussian normal 
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ANSI X9.62 1999 | The elliptic curve digital signature algorithm [14] 
IEEE 1363-2000 | 2000 | Standard specifications for public-key cryptography | [204] 


ISO/IEC 15946-1 | 2002 | Techniques based on elliptic curves—Part 1: General | [211] 


ISO/IEC 15946-2 | 2002 | Part 2: Digital signatures 
ISO/IEC 15946-3 | 2002 | Part 3: Key establishment 
ISO/IEC 15946-4 | (draft) | Part 4: Digital signatures giving message recovery 
ISO/IEC 18033-2 | (draft) | Encryption algorithms—Part 2: Asymmetric ciphers 


SEC 1 2000 | Elliptic curve cryptography [417] 
SEC 2 2000 | Recommended elliptic curve domain parameters [418] 


Table B.1. Selected standards and draft standards that specify cryptographic mechanisms based 
on elliptic curves. 


ANSI American National Standards Institute 
http://www.ansi.org 
x9 Standards for the Financial Services Industry 
http://www.x9.org 
IEEE Institute of Electrical and Electronics Engineers 
http://www.ieee.org 
P1363 Specifications for Public-Key Cryptography 
http://grouper.ieee.org/groups/1363 
S 
S 


ISO International Organization for Standardization 
http://www.iso.ch 
IEC International Electrotechnical Commission 
http://www.iec.ch 
SC 27 Information Technology — Security Techniques 
http://www.din.de/ni/sc27 
S 
Cc 





NIST National Institute of Standards and Technology 
http://www.nist.gov 

FIPS Federal Information Processing Standards 
http://www. itl nist.gov/fipspubs 

SECG Standards for Efficient Cryptography Group 
http://www.secg.org 
Standards for Efficient Cryptography documents 
http://www.secg.org/secg_docs.htm 


NESSIE New European Schemes for Signatures, Integrity and Encryption 
http://www.cryptonessie.org 


IPA Information-technology Promotion Agency 
http://www.ipa.go.jp/ipa-e/index-e.html 

CRYPTREC | Cryptographic Research and Evaluation Committee 
http://www.ipa.go.jp/security/enc/CRY PTREC/index-e.html 


E 
E 





Table B.2. URLs for standards bodies and working groups. 
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basis is mandated. The primary security requirement is that the order n of the base point 
P should be greater than 2'°°, The only hash function employed is SHA-1; however, it 
is anticipated that ANSI X9.62 and X9.63 will be updated in the coming years to allow 
for hash functions of varying output lengths. 


National Institute of Standards and Technology (NIST) NIST is a non-regulatory 
federal agency within the U.S. Commerce Department’s Technology Administration. 
Included in its mission is the development of security-related Federal Information Pro- 
cessing Standards (FIPS) intended for use by U.S. federal government departments. 
The FIPS standards widely adopted and depolyed around the world include the Data 
Encryption Standard (DES: FIPS 46), the Secure Hash Algorithms (SHA-1, SHA-256, 
SHA-384 and SHA-512: FIPS 180-2 [138]), the Advanced Encryption Standard (AES: 
FIPS 197 [141]), and Hash-based Message Authentication Code (HMAC: FIPS 198 
[142]). FIPS 186-2, also known as the Digital Signature Standard (DSS), specifies the 
RSA, DSA and ECDSA signature schemes. ECDSA is specified simply by reference 
to ANSI X9.62 with a recommendation to use the 15 elliptic curves listed in §A.2.1, 
§A.2.2 and §A.2.3. NIST is in the process of developing a recommendation [342] for 
elliptic curve key establishment schemes that will include a selection of protocols from 
ANSI X9.63. 


Institute of Electrical and Electronics Engineers (IEEE) The IEEE P1363 work- 
ing group is developing a suite of standards for public-key cryptography. The scope 
of P1363 is very broad and includes schemes based on the intractability of inte- 
ger factorization, discrete logarithm in finite fields, elliptic curve discrete logarithms, 
and lattice-based schemes. The 1363-2000 standard includes elliptic curve signature 
schemes (ECDSA and an elliptic curve analogue of a signature scheme due to Ny- 
berg and Rueppel), and elliptic curve key agreement schemes (ECMQV and variants 
of elliptic curve Diffie-Hellman (ECDH)). It differs fundamentally from the ANSI stan- 
dards and FIPS 186-2 in that there are no mandated minimum security requirements 
and there is an abundance of options. Its primary purpose, therefore, is to serve as a 
reference for specifications of a variety of cryptographic protocols from which other 
standards and applications can select. The 1363-2000 standard restricts the underlying 
finite field to be a prime field F , or a binary field Fy”. The P1363a draft standard is an 
addendum to 1363-2000. It contains specifications of ECIES and the Pintsov- Vanstone 
signature scheme providing message recovery, and allows for extension fields F,» of 
odd characteristic including optimal extension fields (see §2.4). 


International Organization for Standardization (ISO) ISO and the International 
Electrotechnical Commission (IEC) jointly develop cryptographic standards within 
the SC 27 subcommittee. ISO/IEC 15946 is a suite of elliptic curve cryptographic 
standards that specifies signature schemes (including ECDSA and EC-KCDSA), key 
establishment schemes (including ECMQV and STS), and digital signature schemes 
providing message recovery. ISO/IEC 18033-2 provides detailed descriptions and se- 
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curity analyses of various public-key encryption schemes including ECIES-KEM and 
PSEC-KEM. 


Standards for Efficient Cryptography Group (SECG) SECG is a consortium of 
companies formed to address potential interoperability problems with cryptographic 
standards. SEC 1 specifies ECDSA, ECIES, ECDH and ECMQYV, and attempts to be 
compatible with all ANSI, NIST, IEEE and ISO/IEC elliptic curve standards. Some 
specific elliptic curves, including the 15 NIST elliptic curves, are listed in SEC 2. 


New European Schemes for Signatures, Integrity and Encryption (NESSIE) The 
NESSIE project was funded by the European Union’s Fifth Framework Programme. 
Its main objective was to assess and select various symmetric-key primitives (block 
ciphers, stream ciphers, hash functions, message authentication codes) and public-key 
primitives (public-key encryption, signature and identification schemes). The elliptic 
curve schemes selected were ECDSA and the key transport protocols PSEC-KEM and 
ACE-KEM. 


Cryptographic Research and Evaluation Committee (CRYPTREC) The Inform- 
ation-technology Promotion Agency (IPA) in Japan formed the CRYPTREC committee 
for the purpose of evaluating cryptographic protocols for securing the Japanese gov- 
ernment’s electronic business. Numerous symmetric-key and public-key primitives are 
being evaluated, including ECDSA, ECIES, PSEC-KEM and ECDH. 


APPENDIX C 





Software Tools 


This appendix lists software tools of interest to practitioners and educators. The listing 
is separated into two sections. §C.1 includes research and other tools, most of which 
are fairly general-purpose and do not necessarily require programming. §C.2 entries 
are more specialized or contain libraries to be used with programming languages such 
as C. Generally speaking, §C.1 is of interest to those involved in education and with 
prototyping, while developers may be primarily interested in §C.2. Researchers have 
used packages from both sections. The descriptions provided are, for the most part, 
adapted directly from those given by the package authors. 


C.1 General-purpose tools 


The entries in this section vary in capability and interface, with be and calc as fairly 
basic tools, and Maple, Mathematica, and MuPAD offering sophisticated graphics and 
advanced user interfaces. Magma is significantly more specialized than tools such as 
Mathematica, and has excellent support for elliptic curve operations such as point 
counting. GAP and KANT/KASH can be regarded as the most specialized of the 
packages in this section. 


be http://www.gnu.org 


be is a language that supports arbitrary precision numbers with interactive exe- 
cution. There are some similarities in the syntax to the C programming language. 
bc has the advantage of its wide availability and may be useful as a calculator 
and in prototyping. Keith Matthews has written several be programs in number 
theory, http://www.numbertheory.org/gnubc/. 
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Cale http://www.gnu.org 


Calc is an interactive calculator providing for easy large numeric calculations. 
It can also be programmed for difficult or long calculations. Functions are pro- 
vided for basic modular arithmetic. Calc, developed by David I. Bell and Landon 
Curt Noll with contributions, is hosted on SourceForge, http://sourceforge.net/ 
projects/calc/. 


GAP http://www.gap-system.org 


GAP (Groups, Algorithms and Programming) is a system for computational 
discrete algebra with particular emphasis on computational group theory. Ca- 
pabilities include long integer and rational arithmetic, cyclotomic fields, finite 
fields, residue class rings, p-adic numbers, polynomials, vectors and matrices, 
various combinatorial functions, elementary number theory, and a wide variety 
of list operations. GAP was developed at Lehrstuhl D fiir Mathematik, RWTH 
Aachen, Germany beginning in 1986, and then transferred to the University of 
St. Andrews, Scotland in 1997. 


KANT/KASH http://www.math.tu-berlin.de/~ kant/kash.html 


The Computational Algebraic Number Theory package is designed for sophis- 
ticated computations in number fields and in global function fields. KASH is 
the KAnt SHell, a front-end to KANT. Development is directed by Prof. Dr. M. 
Pohst at the Technische Universitat Berlin. 


Magma http://magma.maths.usyd.edu.au 


The Magma Computational Algebra System “is a large, well-supported software 
package designed to solve computationally hard problems in algebra, number 
theory, geometry and combinatorics. It provides a mathematically rigorous en- 
vironment for computing with algebraic, number-theoretic, combinatoric and 
geometric objects.” In particular, there is extensive support for elliptic curve 
operations. 


Magma is produced and distributed by the Computational Algebra Group within 
the School of Mathematics and Statistics of the University of Sydney. “While 
Magma is a non-commercial system, we are required to recover all costs arising 
from its distribution and support.” 


Maple http://www.maplesoft.com 


Maple is an advanced mathematical problem-solving and programming en- 
vironment. The University of Waterloo’s Symbolic Computation Group (Wa- 
terloo, Canada) initially developed the Maple symbolic technology. Maple is 
commercial—historically, student and academic licensing has been relatively 
generous. 
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Mathematica http://www.wolfram.com 


Mathematica is a general-purpose technical computing system, combining fast, 
high-precision numeric and symbolic computation with easy-to-use data visu- 
alization and programming capabilities. Wolfram Research, the developer of 
Mathematica, was founded by Stephen Wolfram in 1987. 


MuPAD http://www.mupad.de 


MuPAD is a general-purpose computer algebra system for symbolic and numeri- 
cal computations. Users can view the library code, implement their own routines 
and data types easily, and can also dynamically link C/C++ compiled modules 
for raw speed and flexibility. 


MuPAD was originally developed by the MuPAD Research Group under di- 
rection of Prof. B. Fuchssteiner at the University of Paderborn (Germany). 
Free licenses are available; commercial versions can be obtained from SciFace 
Software. Several books on MuPAD have been published, including the paper- 
back MuPAD Tutorial: A version and platform independent introduction, by J. 
Gerhard, W. Oevel, F. Postel, and S. Wehmeier, Springer-Verlag, 2000. 





C.2 Libraries 


In contrast to most of the entries in §C.1, the packages in this section are more special- 
ized. For example, some are libraries intended for programmers using languages such 
as C or C++. 

The most basic is GNU MP, a library supporting arbitrary-precision arithmetic 
routines. It is recommended for its performance across many platforms. Crypto++ 
offers an extensive list of routines for cryptographic use, in an elegant C++ frame- 
work. OpenSSL, MIRACL, and cryptlib are similarly ambitious. Developed from 
SSLeay, OpenSSL is widely used in applications such as the Apache web server 
and OpenSSH, and has also been used strictly for its big number routines. MIRACL 
provides executables for elliptic curve point counting. 

In addition to integer and polynomial arithmetic, LiDIA and NTL provide sophis- 
ticated number-theoretic algorithms. Along with PARI-GP, these tools may be of 
particular interest to researchers. 


cryptlib http://www.cs.auckland.ac.nz/~ pgut001/cryptlib/ 


Although elliptic curve methods are not included, the cryptlib security toolkit 
from Peter Gutmann is notable for its range of encryption, digital signature, key 
and certificate management, and message security services, with support for a 
wide variety of crypto hardware. In particular, cryptlib emphasizes ease of use 
of high-level services such as SSH, SSL, S/MIME, and PGP. The big number 
routines are from OpenSSL. The toolkit runs on a wide range of platforms, has a 
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dual-license for open source and commercial use, and substantial documentation 
is available. 


Crypto++ http://www.eskimo.com/~ weidai/cryptlib.html 


Crypto++ is a free C++ library from Wei Dai for cryptography, and includes 
ciphers, message authentication codes, one-way hash functions, public-key cryp- 
tosystems, and key agreement schemes. The project is hosted on SourceForge, 
http://sourceforge.net/projects/cryptopp/. 


GNU MP http://www.swox.com/gmp/ 


GMP is a free library for arbitrary precision arithmetic, operating on signed in- 
tegers, rational numbers, and floating point numbers. It focuses on speed rather 
than simplicity or elegance. 


Libgcrypt http://www.gnu.org/directory/security/libgcrypt.html 


Libgcrypt is a general-purpose cryptographic library based on the code from 
GnuPG (an OpenPGP compliant application). It provides functions for crypto- 
graphic building blocks including symmetric ciphers, hash algorithms, MACs, 
public key algorithms, large integers (using code derived from GNU MP), and 
random numbers. 


LiDIA http://www.informatik.tu-darmstadt.de/TI/LiDIA/ 


LiDIA is a C++ library for computational number theory which provides a col- 
lection of highly optimized implementations of various multiprecision data types 
and time-intensive algorithms. In particular, the library contains algorithms for 
factoring and for point counting on elliptic curves. The developer is the LiDIA 
Group at the Darmstadt University of Technology (Germany). 


MIRACL http://indigo.ie/~ mscott/ 


NTL: 


The Multiprecision Integer and Rational Arithmetic C/C++ Library implements 
primitives supporting symmetric-key and public-key methods, including elliptic 
curve methods and point counting. Licensed through Shamus Software Ltd. (Ire- 
land), it is “FREE for non-profit making, educational, or any non-commercial 


” 


use. 


A Library for doing Number Theory http://www.shoup.net/ntl/ 


NTL is a high-performance portable C++ library providing data structures and 
algorithms for arbitrary length integers; for vectors, matrices, and polynomials 
over the integers and over finite fields; and for arbitrary precision floating point 
arithmetic. In particular, the library contains state-of-the-art implementations for 
lattice basis reduction. NTL is maintained by Victor Shoup. 
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OpenSSL http://www.openssl.org 


The OpenSSL Project is a collaborative effort to develop a robust, full-featured, 
and Open Source toolkit implementing the Secure Sockets Layer (SSL v2/v3) 
and Transport Layer Security (TLS v1) protocols as well as a general-purpose 
cryptography library. OpenSSL is based on the SSLeay library developed by 
Eric A. Young and Tim J. Hudson. 


PARI-GP http://www.parigp-home.de 


PARI-GP is a computer-aided number theory package, consisting of a C library 
and the programmable interactive gp calculator. Originally developed at Bor- 
deaux by a team led by Henri Cohen, PARI-GP is now maintained by Karim 
Belabas at the Université Paris-Sud Orsay with many contributors. 
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National Institute of Standards and Tech- Parallelized Pollard’s rho attack, 160 
nology, see NIST Pentanomial, 54, 130, 258 
NESSIE, 191, 270 Pipelining, 226 
New European Schemes for Signatures, In- Pohlig-Hellman attack, 155 
tegrity and Encryption, see NESSIE Point, 13 
NIST, 269 double, 79 
FIPS 180-2, 269 sum, 79 
FIPS 186, 10 Point at infinity, 13, 76 
FIPS 186-2, 184, 257, 261, 269 Point counting algorithms, 179-180, 201 
FIPS 197, 269 Point halving, 129-141, 151 
FIPS 198, 269 halve-and-add, 137-141 
FIPS 46, 269 Point multiplication, 95—113 


prime, 44, 220 binary NAF method, 99 
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comparisons, 141-147 

fixed-base comb method, 106 

fixed-base NAF windowing method, 
105 

fixed-base windowing method, 104 

halve-and-add, 137-141 

interleaving, 111 

left-to-right binary method, 97 

Lim-Lee method, 108 

right-to-left binary method, 96 

sliding window method, 101 

timings, 146-147 

TNAF method, 119 

window NAF method, 100 

window TNAF method, 123 

with efficiently computable endomor- 


phisms, 129 
Pollard’s rho attack, 17, 18, 157, 197 
Polynomial 


Karatsuba-Ofman multiplication, 51 
multiplication, 48 
reduction, 53 
squaring, 52 
Polynomial basis, 26 
Polynomial security, 203 
Polynomial-time algorithm, 16 
Power analysis, 239-244 
DPA, 242, 254 
SPA, 240, 254 
Power trace, 240 
Prime field, 26 
addition, 30 
arithmetic with SIMD, 214, 224, 250 
integer multiplication, 31 
integer squaring, 34 
inversion, 39 
Karatsuba-Ofman multiplication, 32, 
223 
reduction, 35 
subtraction, 30 
timings, 219-220, 223-224 
Prime-field-anomalous curve, 168, 198 
Primitive element, 63 
Program optimizations 
assembly coding, 217 
duplicated code, 216 


loop unrolling, 216 
Projective coordinates, see coordinates 
Projective point, 87 
PSEC, 191 
Public key validation, 180, 201 
Public-key cryptography, 4—5 
Public-key encryption, 188-192 

Cramer-Shoup, 204 

ECIES, 189, 203 

malleability, 189 

polynomial security, 203 

PSEC, 191 

security, 188 

semantic security, 203 
Public-key infrastructure, 5 


Q 
Quadratic number field, 22, 165 


Quantum computer, 196 
Qubit, 196 


R 

Rational points, 76 

RC4, 3 

Reduction 
Barrett, 36, 70, 220 
Montgomery, 38, 70 
polynomial, 27, 28 

RSA, 6-8 
basic encryption scheme, 6 
basic signature scheme, 7 
FDH, 248 
key pair generation, 6 
OAEP, 245, 256 
PSS, 249 

Running time, 16 


S 

Satoh’s algorithm, 180, 201 

Scalar multiplication, see point multiplica- 
tion 

Schoof’s algorithm, 179, 201 

SEA algorithm, 179, 201 

SECG, 270 

Security level, 18 

Semantic security, 203 


Session key, 192 

SHA-1, 173 

Shamir’s trick, 109 

Side-channel attack, 238-250 
electromagnetic analysis, 244 
error message analysis, 244-248 
fault analysis, 248-249 
optical fault induction, 256 
power analysis, 239-244 
timing, 250 

Signature scheme 
EC-KCDSA, 186, 202 
ECDSA, 184, 202 
security, 183 

Signed digit representation, 98 

SIMD, 213, 224, 250 

Simple power analysis, see SPA 

Simultaneous inversion, 44 


Single-instruction multiple-data, see SIMD 


SKEME, 204 
SKIPJACK, 18 
Small subgroup attack, 181, 201 
SPA, 240, 254 
SSL, 182, 228, 250, 256 
see also OpenSSL 
SST algorithm, 180, 201 
Standards, 267-270 
ANSI, 267 
CRYPTREC, 191, 270 
FIPS, 269 
IEEE, 269 
ISO/IEC, 269 
NESSIE, 191, 270 
NIST, 269 
SECG, 270 
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(SECG), 270 
Station-to-station protocol, 193, 204 
STS, see station-to-station protocol 
Subexponential-time algorithm, 16 
Subfield, 28 
Superelliptic curve, 22 
Supersingular, 79, 83 
Symmetric-key cryptography, 3-4 
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Tate pairing attack, 169, 198 
Throughput, 208 
Timing attack, 250, 256 
TNAF, 117 
Trace 

function, 130 

of an elliptic curve, 82 
Trinomial, 53, 54, 130, 258 
Triple-DES, 18 


U 
Underlying field, 77 
Unknown key-share resilience, 193 


Vv 
VLSI, 225 


WwW 
Weierstrass equation, 77 
Weil 
descent attack, 170, 199 
pairing attack, 169, 198 
Width-w 
NAF, 99 
TNAF, 120 


xX 
Xedni calculus, 198 
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